From wesmckinn at gmail.com Sat Jun 1 02:31:18 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 31 May 2013 17:31:18 -0700 Subject: [Pandas-dev] mailing list In-Reply-To: References: Message-ID: On Fri, May 31, 2013 at 2:18 PM, Phillip Cloud wrote: > hey guys, is it possible we could move this list to google groups? > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > I'm not opposed to the idea. Anyone? Bueller? From yoval at gmx.com Sat Jun 1 03:19:32 2013 From: yoval at gmx.com (yoval p.) Date: Sat, 01 Jun 2013 04:19:32 +0300 Subject: [Pandas-dev] mailing list In-Reply-To: References: Message-ID: <51A94C24.5000404@gmx.com> Annoying interface, mandatory login/google account and tracking galore. -1 here. On 06/01/2013 03:31 AM, Wes McKinney wrote: > On Fri, May 31, 2013 at 2:18 PM, Phillip Cloud wrote: >> hey guys, is it possible we could move this list to google groups? >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev >> > > I'm not opposed to the idea. Anyone? Bueller? > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > From wesmckinn at gmail.com Sat Jun 1 03:25:11 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 31 May 2013 18:25:11 -0700 Subject: [Pandas-dev] mailing list In-Reply-To: <51A94C24.5000404@gmx.com> References: <51A94C24.5000404@gmx.com> Message-ID: On Fri, May 31, 2013 at 6:19 PM, yoval p. wrote: > Annoying interface, mandatory login/google account and tracking galore. > -1 here. > > On 06/01/2013 03:31 AM, Wes McKinney wrote: >> On Fri, May 31, 2013 at 2:18 PM, Phillip Cloud wrote: >>> hey guys, is it possible we could move this list to google groups? >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> http://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> I'm not opposed to the idea. Anyone? Bueller? >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev I'd prefer to stay here on mailman myself after some contemplation, the pydata mailing list is for users and general questions and should be searchable. - Wes From cpcloud at gmail.com Sat Jun 1 03:28:59 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Fri, 31 May 2013 21:28:59 -0400 Subject: [Pandas-dev] mailing list In-Reply-To: References: <51A94C24.5000404@gmx.com> Message-ID: ok. i don't feel that strong about it. just thought that the interface would be nicer but it's not a big deal to just use email either. -- Best, Phillip Cloud On Fri, May 31, 2013 at 9:25 PM, Wes McKinney wrote: > On Fri, May 31, 2013 at 6:19 PM, yoval p. wrote: > > Annoying interface, mandatory login/google account and tracking galore. > > -1 here. > > > > On 06/01/2013 03:31 AM, Wes McKinney wrote: > >> On Fri, May 31, 2013 at 2:18 PM, Phillip Cloud > wrote: > >>> hey guys, is it possible we could move this list to google groups? > >>> > >>> _______________________________________________ > >>> Pandas-dev mailing list > >>> Pandas-dev at python.org > >>> http://mail.python.org/mailman/listinfo/pandas-dev > >>> > >> > >> I'm not opposed to the idea. Anyone? Bueller? > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> http://mail.python.org/mailman/listinfo/pandas-dev > >> > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > http://mail.python.org/mailman/listinfo/pandas-dev > > I'd prefer to stay here on mailman myself after some contemplation, > the pydata mailing list is for users and general questions and should > be searchable. > > - Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Sat Jun 1 03:52:20 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 31 May 2013 18:52:20 -0700 Subject: [Pandas-dev] 0.11.1 beta plan Message-ID: hi guys, I'll plan to sort out remaining issues and cut a 0.11.1 release candidate this weekend. Any objections? Thanks From jeffreback at gmail.com Sat Jun 1 06:47:19 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Sat, 1 Jun 2013 00:47:19 -0400 Subject: [Pandas-dev] 0.11.1 beta plan In-Reply-To: References: Message-ID: ok by me just ran perf bench vs 0.11 - things look good - nothing major @y-p is BUILD_CACHE_DIR still enabled? test_perf -b base -t current not creating any cached builds? or am I doing something wrong? On May 31, 2013, at 9:52 PM, Wes McKinney wrote: > hi guys, > > I'll plan to sort out remaining issues and cut a 0.11.1 release > candidate this weekend. Any objections? > > Thanks > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev From yoval at gmx.com Sat Jun 1 18:59:15 2013 From: yoval at gmx.com (yoval p.) Date: Sat, 01 Jun 2013 19:59:15 +0300 Subject: [Pandas-dev] 0.11.1 beta plan In-Reply-To: References:

Message-ID: <51AA2863.7040409@gmx.com> The build cache was taken out of setup.py and placed in scripts/use_build_cache.py where it belongs. By invoking that script, you monkey patch setup.py with the build cache logic. I did that so I could use the build cache with historical commits before the build cache was introduced, to simplify setup.py, and to makes sure work on the build cache could not break setup.py no matter what. When used with -b -t (rather then -H), test_perf delegates everything to the vbench module which doesn't know anything about the build cache. So you will get cache action for commits from the period where that logic was integral to setup.py, but not commits before or after. I would suggest that the whole thing is now broken up into 3 decoupled parts, each indep. useful: 1) Use the build cache to quickly jump to any point in history (that you've built before). 2) Use test_perf -H to benchmark HEAD, and save the results to a pickle file. 3) Use (or reuse previously saved) the pickle files to generate a report. 3 is what's currently missing, but practically all there in test_perf, which just needs some refactoring to support that. Once that's there, it would be a trivial shell script to automate the whole thing. Maybe I'll put something together over the weekend. yoval P.S. 'Thunderbirds are go' for 0.11.1 On 06/01/2013 07:47 AM, Jeff Reback wrote: > ok by me > just ran perf bench vs 0.11 - things look good - nothing major > > @y-p is BUILD_CACHE_DIR still enabled? > > test_perf -b base -t current > > not creating any cached builds? > > or am I doing something wrong? > > > > On May 31, 2013, at 9:52 PM, Wes McKinney wrote: > >> hi guys, >> >> I'll plan to sort out remaining issues and cut a 0.11.1 release >> candidate this weekend. Any objections? >> >> Thanks >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > From cpcloud at gmail.com Mon Jun 3 00:21:41 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Sun, 2 Jun 2013 18:21:41 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: <20130602061417.6118.71824.malone@chaenomeles.canonical.com> References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com> Message-ID: This is the reply I got from the lxml people about an "incorrect" parse of the failed bank list page. It wasn't actually an incorrect parse, the page has invalid markup and lxml makes no promises about that. Moral of the story: only use html5lib when parsing HTML tables. Should I reopen the lxml functionality then, with a big honking error in the documentation telling users to tidy up the HTML they want to parse if they want to use lxml or just scrap the lxml functionality entirely? No need to clutter up the codebase. -- Best, Phillip Cloud ---------- Forwarded message ---------- From: scoder <1181905 at bugs.launchpad.net> Date: Sun, Jun 2, 2013 at 2:14 AM Subject: [Bug 1181905] Re: tr elements are not parsed correctly To: cpcloud at gmail.com The HTML page doesn't validate, even my browser shows me an HTML error. The tag you are looking for is not inside of a tag, so it's actually correct that the last two tests in your script fail because they are looking for something that's not there. If you think that the parser in libxml2 should be able to fix this HTML error automatically, rather than just parsing through it, please file a bug report for the libxml2 project. Alternatively, adapt your script to the broken HTML or use an HTML tidying tool to fix the markup. ** Changed in: lxml Status: New => Invalid -- You received this bug notification because you are subscribed to the bug report. https://bugs.launchpad.net/bugs/1181905 Title: tr elements are not parsed correctly Status in lxml - the Python XML toolkit: Invalid Bug description: Python : sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0) lxml.etree : (3, 2, 1, 0) libxml used : (2, 9, 1) libxml compiled : (2, 9, 1) libxslt used : (1, 1, 28) libxslt compiled : (1, 1, 28) See the attached script. The url http://www.fdic.gov/bank/individual/failed/banklist.html is not parsed correctly by lxml. the element containing 'Gold Canyon' is just left out, while all of the other elements seem to be there. To manage notifications about this bug go to: https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jun 3 03:19:41 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 2 Jun 2013 18:19:41 -0700 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com> Message-ID: On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud wrote: > This is the reply I got from the lxml people about an "incorrect" parse of > the failed bank list page. It wasn't actually an incorrect parse, the page > has invalid markup and lxml makes no promises about that. Moral of the > story: only use html5lib when parsing HTML tables. Should I reopen the lxml > functionality then, with a big honking error in the documentation telling > users to tidy up the HTML they want to parse if they want to use lxml or > just scrap the lxml functionality entirely? No need to clutter up the > codebase. > > -- > Best, > Phillip Cloud > > > ---------- Forwarded message ---------- > From: scoder <1181905 at bugs.launchpad.net> > Date: Sun, Jun 2, 2013 at 2:14 AM > Subject: [Bug 1181905] Re: tr elements are not parsed correctly > To: cpcloud at gmail.com > > > The HTML page doesn't validate, even my browser shows me an HTML error. > The tag you are looking for is not inside of a tag, so it's > actually correct that the last two tests in your script fail because > they are looking for something that's not there. > > If you think that the parser in libxml2 should be able to fix this HTML > error automatically, rather than just parsing through it, please file a > bug report for the libxml2 project. Alternatively, adapt your script to > the broken HTML or use an HTML tidying tool to fix the markup. > > > ** Changed in: lxml > Status: New => Invalid > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1181905 > > Title: > tr elements are not parsed correctly > > Status in lxml - the Python XML toolkit: > Invalid > > Bug description: > Python : sys.version_info(major=2, minor=7, micro=5, > releaselevel='final', serial=0) > lxml.etree : (3, 2, 1, 0) > libxml used : (2, 9, 1) > libxml compiled : (2, 9, 1) > libxslt used : (1, 1, 28) > libxslt compiled : (1, 1, 28) > > See the attached script. The url > http://www.fdic.gov/bank/individual/failed/banklist.html is not parsed > correctly by lxml. the element containing 'Gold Canyon' is just left > out, while all of the other elements seem to be there. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. Wasted a lot of time already on this today so the release candidate is going to have to wait until this is sorted out and passing cleanly. From cpcloud at gmail.com Mon Jun 3 03:31:52 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Sun, 2 Jun 2013 21:31:52 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com> Message-ID: That is strange. Can you give me the gist of what the traceback is? I'm using the same except my lxml is 2.9.1 but that shouldn't matter. I vote to get rid of the lxml functionality since it's not going to parse invalid html, which is what most of the web consists of. -- Best, Phillip Cloud On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney wrote: > On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud wrote: > > This is the reply I got from the lxml people about an "incorrect" parse > of > > the failed bank list page. It wasn't actually an incorrect parse, the > page > > has invalid markup and lxml makes no promises about that. Moral of the > > story: only use html5lib when parsing HTML tables. Should I reopen the > lxml > > functionality then, with a big honking error in the documentation telling > > users to tidy up the HTML they want to parse if they want to use lxml or > > just scrap the lxml functionality entirely? No need to clutter up the > > codebase. > > > > -- > > Best, > > Phillip Cloud > > > > > > ---------- Forwarded message ---------- > > From: scoder <1181905 at bugs.launchpad.net> > > Date: Sun, Jun 2, 2013 at 2:14 AM > > Subject: [Bug 1181905] Re: tr elements are not parsed correctly > > To: cpcloud at gmail.com > > > > > > The HTML page doesn't validate, even my browser shows me an HTML error. > > The tag you are looking for is not inside of a tag, so it's > > actually correct that the last two tests in your script fail because > > they are looking for something that's not there. > > > > If you think that the parser in libxml2 should be able to fix this HTML > > error automatically, rather than just parsing through it, please file a > > bug report for the libxml2 project. Alternatively, adapt your script to > > the broken HTML or use an HTML tidying tool to fix the markup. > > > > > > ** Changed in: lxml > > Status: New => Invalid > > > > -- > > You received this bug notification because you are subscribed to the bug > > report. > > https://bugs.launchpad.net/bugs/1181905 > > > > Title: > > tr elements are not parsed correctly > > > > Status in lxml - the Python XML toolkit: > > Invalid > > > > Bug description: > > Python : sys.version_info(major=2, minor=7, micro=5, > > releaselevel='final', serial=0) > > lxml.etree : (3, 2, 1, 0) > > libxml used : (2, 9, 1) > > libxml compiled : (2, 9, 1) > > libxslt used : (1, 1, 28) > > libxslt compiled : (1, 1, 28) > > > > See the attached script. The url > > http://www.fdic.gov/bank/individual/failed/banklist.html is not parsed > > correctly by lxml. the element containing 'Gold Canyon' is just left > > out, while all of the other elements seem to be there. > > > > To manage notifications about this bug go to: > > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions > > > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > http://mail.python.org/mailman/listinfo/pandas-dev > > > > Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. > Wasted a lot of time already on this today so the release candidate is > going to have to wait until this is sorted out and passing cleanly. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jun 3 03:47:01 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 2 Jun 2013 18:47:01 -0700 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com> Message-ID: On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud wrote: > That is strange. Can you give me the gist of what the traceback is? I'm > using the same except my lxml is 2.9.1 but that shouldn't matter. I vote to > get rid of the lxml functionality since it's not going to parse invalid > html, which is what most of the web consists of. > > > -- > Best, > Phillip Cloud > > > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney wrote: >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud wrote: >> > This is the reply I got from the lxml people about an "incorrect" parse >> > of >> > the failed bank list page. It wasn't actually an incorrect parse, the >> > page >> > has invalid markup and lxml makes no promises about that. Moral of the >> > story: only use html5lib when parsing HTML tables. Should I reopen the >> > lxml >> > functionality then, with a big honking error in the documentation >> > telling >> > users to tidy up the HTML they want to parse if they want to use lxml or >> > just scrap the lxml functionality entirely? No need to clutter up the >> > codebase. >> > >> > -- >> > Best, >> > Phillip Cloud >> > >> > >> > ---------- Forwarded message ---------- >> > From: scoder <1181905 at bugs.launchpad.net> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly >> > To: cpcloud at gmail.com >> > >> > >> > The HTML page doesn't validate, even my browser shows me an HTML error. >> > The tag you are looking for is not inside of a tag, so it's >> > actually correct that the last two tests in your script fail because >> > they are looking for something that's not there. >> > >> > If you think that the parser in libxml2 should be able to fix this HTML >> > error automatically, rather than just parsing through it, please file a >> > bug report for the libxml2 project. Alternatively, adapt your script to >> > the broken HTML or use an HTML tidying tool to fix the markup. >> > >> > >> > ** Changed in: lxml >> > Status: New => Invalid >> > >> > -- >> > You received this bug notification because you are subscribed to the bug >> > report. >> > https://bugs.launchpad.net/bugs/1181905 >> > >> > Title: >> > tr elements are not parsed correctly >> > >> > Status in lxml - the Python XML toolkit: >> > Invalid >> > >> > Bug description: >> > Python : sys.version_info(major=2, minor=7, micro=5, >> > releaselevel='final', serial=0) >> > lxml.etree : (3, 2, 1, 0) >> > libxml used : (2, 9, 1) >> > libxml compiled : (2, 9, 1) >> > libxslt used : (1, 1, 28) >> > libxslt compiled : (1, 1, 28) >> > >> > See the attached script. The url >> > http://www.fdic.gov/bank/individual/failed/banklist.html is not parsed >> > correctly by lxml. the element containing 'Gold Canyon' is just left >> > out, while all of the other elements seem to be there. >> > >> > To manage notifications about this bug go to: >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >> > >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > http://mail.python.org/mailman/listinfo/pandas-dev >> > >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. >> Wasted a lot of time already on this today so the release candidate is >> going to have to wait until this is sorted out and passing cleanly. > > Perhaps it should attempt lxml and fall back on BS? When lxml succeeds it is much faster. From cpcloud at gmail.com Mon Jun 3 03:57:00 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Sun, 2 Jun 2013 21:57:00 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: yeah that's better than dumping it altogether. you can use a strict parser that doesn't try to recover broken html. btw what tests are breaking? i can't get any of them to break... -- Best, Phillip Cloud On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney wrote: > On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud wrote: > > That is strange. Can you give me the gist of what the traceback is? I'm > > using the same except my lxml is 2.9.1 but that shouldn't matter. I vote > to > > get rid of the lxml functionality since it's not going to parse invalid > > html, which is what most of the web consists of. > > > > > > -- > > Best, > > Phillip Cloud > > > > > > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney > wrote: > >> > >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud > wrote: > >> > This is the reply I got from the lxml people about an "incorrect" > parse > >> > of > >> > the failed bank list page. It wasn't actually an incorrect parse, the > >> > page > >> > has invalid markup and lxml makes no promises about that. Moral of the > >> > story: only use html5lib when parsing HTML tables. Should I reopen the > >> > lxml > >> > functionality then, with a big honking error in the documentation > >> > telling > >> > users to tidy up the HTML they want to parse if they want to use lxml > or > >> > just scrap the lxml functionality entirely? No need to clutter up the > >> > codebase. > >> > > >> > -- > >> > Best, > >> > Phillip Cloud > >> > > >> > > >> > ---------- Forwarded message ---------- > >> > From: scoder <1181905 at bugs.launchpad.net> > >> > Date: Sun, Jun 2, 2013 at 2:14 AM > >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly > >> > To: cpcloud at gmail.com > >> > > >> > > >> > The HTML page doesn't validate, even my browser shows me an HTML > error. > >> > The tag you are looking for is not inside of a tag, so it's > >> > actually correct that the last two tests in your script fail because > >> > they are looking for something that's not there. > >> > > >> > If you think that the parser in libxml2 should be able to fix this > HTML > >> > error automatically, rather than just parsing through it, please file > a > >> > bug report for the libxml2 project. Alternatively, adapt your script > to > >> > the broken HTML or use an HTML tidying tool to fix the markup. > >> > > >> > > >> > ** Changed in: lxml > >> > Status: New => Invalid > >> > > >> > -- > >> > You received this bug notification because you are subscribed to the > bug > >> > report. > >> > https://bugs.launchpad.net/bugs/1181905 > >> > > >> > Title: > >> > tr elements are not parsed correctly > >> > > >> > Status in lxml - the Python XML toolkit: > >> > Invalid > >> > > >> > Bug description: > >> > Python : sys.version_info(major=2, minor=7, micro=5, > >> > releaselevel='final', serial=0) > >> > lxml.etree : (3, 2, 1, 0) > >> > libxml used : (2, 9, 1) > >> > libxml compiled : (2, 9, 1) > >> > libxslt used : (1, 1, 28) > >> > libxslt compiled : (1, 1, 28) > >> > > >> > See the attached script. The url > >> > http://www.fdic.gov/bank/individual/failed/banklist.html is not > parsed > >> > correctly by lxml. the element containing 'Gold Canyon' is just left > >> > out, while all of the other elements seem to be there. > >> > > >> > To manage notifications about this bug go to: > >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions > >> > > >> > > >> > _______________________________________________ > >> > Pandas-dev mailing list > >> > Pandas-dev at python.org > >> > http://mail.python.org/mailman/listinfo/pandas-dev > >> > > >> > >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. > >> Wasted a lot of time already on this today so the release candidate is > >> going to have to wait until this is sorted out and passing cleanly. > > > > > > Perhaps it should attempt lxml and fall back on BS? When lxml succeeds > it is much faster. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jun 3 04:06:56 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 2 Jun 2013 19:06:56 -0700 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud wrote: > yeah that's better than dumping it altogether. you can use a strict parser > that doesn't try to recover broken html. btw what tests are breaking? i > can't get any of them to break... > > > -- > Best, > Phillip Cloud > > > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney wrote: >> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud wrote: >> > That is strange. Can you give me the gist of what the traceback is? I'm >> > using the same except my lxml is 2.9.1 but that shouldn't matter. I vote >> > to >> > get rid of the lxml functionality since it's not going to parse invalid >> > html, which is what most of the web consists of. >> > >> > >> > -- >> > Best, >> > Phillip Cloud >> > >> > >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney >> > wrote: >> >> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud >> >> wrote: >> >> > This is the reply I got from the lxml people about an "incorrect" >> >> > parse >> >> > of >> >> > the failed bank list page. It wasn't actually an incorrect parse, the >> >> > page >> >> > has invalid markup and lxml makes no promises about that. Moral of >> >> > the >> >> > story: only use html5lib when parsing HTML tables. Should I reopen >> >> > the >> >> > lxml >> >> > functionality then, with a big honking error in the documentation >> >> > telling >> >> > users to tidy up the HTML they want to parse if they want to use lxml >> >> > or >> >> > just scrap the lxml functionality entirely? No need to clutter up the >> >> > codebase. >> >> > >> >> > -- >> >> > Best, >> >> > Phillip Cloud >> >> > >> >> > >> >> > ---------- Forwarded message ---------- >> >> > From: scoder <1181905 at bugs.launchpad.net> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly >> >> > To: cpcloud at gmail.com >> >> > >> >> > >> >> > The HTML page doesn't validate, even my browser shows me an HTML >> >> > error. >> >> > The tag you are looking for is not inside of a tag, so it's >> >> > actually correct that the last two tests in your script fail because >> >> > they are looking for something that's not there. >> >> > >> >> > If you think that the parser in libxml2 should be able to fix this >> >> > HTML >> >> > error automatically, rather than just parsing through it, please file >> >> > a >> >> > bug report for the libxml2 project. Alternatively, adapt your script >> >> > to >> >> > the broken HTML or use an HTML tidying tool to fix the markup. >> >> > >> >> > >> >> > ** Changed in: lxml >> >> > Status: New => Invalid >> >> > >> >> > -- >> >> > You received this bug notification because you are subscribed to the >> >> > bug >> >> > report. >> >> > https://bugs.launchpad.net/bugs/1181905 >> >> > >> >> > Title: >> >> > tr elements are not parsed correctly >> >> > >> >> > Status in lxml - the Python XML toolkit: >> >> > Invalid >> >> > >> >> > Bug description: >> >> > Python : sys.version_info(major=2, minor=7, micro=5, >> >> > releaselevel='final', serial=0) >> >> > lxml.etree : (3, 2, 1, 0) >> >> > libxml used : (2, 9, 1) >> >> > libxml compiled : (2, 9, 1) >> >> > libxslt used : (1, 1, 28) >> >> > libxslt compiled : (1, 1, 28) >> >> > >> >> > See the attached script. The url >> >> > http://www.fdic.gov/bank/individual/failed/banklist.html is not >> >> > parsed >> >> > correctly by lxml. the element containing 'Gold Canyon' is just >> >> > left >> >> > out, while all of the other elements seem to be there. >> >> > >> >> > To manage notifications about this bug go to: >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >> >> > >> >> > >> >> > _______________________________________________ >> >> > Pandas-dev mailing list >> >> > Pandas-dev at python.org >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >> >> > >> >> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. >> >> Wasted a lot of time already on this today so the release candidate is >> >> going to have to wait until this is sorted out and passing cleanly. >> > >> > >> >> Perhaps it should attempt lxml and fall back on BS? When lxml succeeds >> it is much faster. > > https://gist.github.com/wesm/5695768 In [3]: import lxml.etree as etree In [4]: etree.__version__ Out[4]: u'3.2.1' libxml2 version 2.9.0. i can upgrade if you think it might be that In [5]: import bs4 In [6]: bs4.__version__ Out[6]: '4.2.1' From cpcloud at gmail.com Mon Jun 3 04:33:16 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Sun, 2 Jun 2013 22:33:16 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: saw that u fixed the first test. second is correctly failing because the value retrieved is wrong. i replicated your setup sans libxml2 and nothing fails. travis is passing these tests, so i'm not sure exactly what the issue is. can you try the following from lxml import parse url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' doc = parse(url) len(doc.xpath('.//table')) > 0 from bs4 import BeautifulSoup from contextlib import closing from urllib2 import urlopen with contextlib.closing(urllib2.urlopen(url)) as f: soup = BeautifulSoup(f.read(), features='lxml') len(soup.find_all('table')) > 0 -- Best, Phillip Cloud On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney wrote: > On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud wrote: > > yeah that's better than dumping it altogether. you can use a strict > parser > > that doesn't try to recover broken html. btw what tests are breaking? i > > can't get any of them to break... > > > > > > -- > > Best, > > Phillip Cloud > > > > > > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney > wrote: > >> > >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud > wrote: > >> > That is strange. Can you give me the gist of what the traceback is? > I'm > >> > using the same except my lxml is 2.9.1 but that shouldn't matter. I > vote > >> > to > >> > get rid of the lxml functionality since it's not going to parse > invalid > >> > html, which is what most of the web consists of. > >> > > >> > > >> > -- > >> > Best, > >> > Phillip Cloud > >> > > >> > > >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney > >> > wrote: > >> >> > >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud > >> >> wrote: > >> >> > This is the reply I got from the lxml people about an "incorrect" > >> >> > parse > >> >> > of > >> >> > the failed bank list page. It wasn't actually an incorrect parse, > the > >> >> > page > >> >> > has invalid markup and lxml makes no promises about that. Moral of > >> >> > the > >> >> > story: only use html5lib when parsing HTML tables. Should I reopen > >> >> > the > >> >> > lxml > >> >> > functionality then, with a big honking error in the documentation > >> >> > telling > >> >> > users to tidy up the HTML they want to parse if they want to use > lxml > >> >> > or > >> >> > just scrap the lxml functionality entirely? No need to clutter up > the > >> >> > codebase. > >> >> > > >> >> > -- > >> >> > Best, > >> >> > Phillip Cloud > >> >> > > >> >> > > >> >> > ---------- Forwarded message ---------- > >> >> > From: scoder <1181905 at bugs.launchpad.net> > >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM > >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly > >> >> > To: cpcloud at gmail.com > >> >> > > >> >> > > >> >> > The HTML page doesn't validate, even my browser shows me an HTML > >> >> > error. > >> >> > The tag you are looking for is not inside of a tag, so > it's > >> >> > actually correct that the last two tests in your script fail > because > >> >> > they are looking for something that's not there. > >> >> > > >> >> > If you think that the parser in libxml2 should be able to fix this > >> >> > HTML > >> >> > error automatically, rather than just parsing through it, please > file > >> >> > a > >> >> > bug report for the libxml2 project. Alternatively, adapt your > script > >> >> > to > >> >> > the broken HTML or use an HTML tidying tool to fix the markup. > >> >> > > >> >> > > >> >> > ** Changed in: lxml > >> >> > Status: New => Invalid > >> >> > > >> >> > -- > >> >> > You received this bug notification because you are subscribed to > the > >> >> > bug > >> >> > report. > >> >> > https://bugs.launchpad.net/bugs/1181905 > >> >> > > >> >> > Title: > >> >> > tr elements are not parsed correctly > >> >> > > >> >> > Status in lxml - the Python XML toolkit: > >> >> > Invalid > >> >> > > >> >> > Bug description: > >> >> > Python : sys.version_info(major=2, minor=7, micro=5, > >> >> > releaselevel='final', serial=0) > >> >> > lxml.etree : (3, 2, 1, 0) > >> >> > libxml used : (2, 9, 1) > >> >> > libxml compiled : (2, 9, 1) > >> >> > libxslt used : (1, 1, 28) > >> >> > libxslt compiled : (1, 1, 28) > >> >> > > >> >> > See the attached script. The url > >> >> > http://www.fdic.gov/bank/individual/failed/banklist.html is not > >> >> > parsed > >> >> > correctly by lxml. the element containing 'Gold Canyon' is just > >> >> > left > >> >> > out, while all of the other elements seem to be there. > >> >> > > >> >> > To manage notifications about this bug go to: > >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions > >> >> > > >> >> > > >> >> > _______________________________________________ > >> >> > Pandas-dev mailing list > >> >> > Pandas-dev at python.org > >> >> > http://mail.python.org/mailman/listinfo/pandas-dev > >> >> > > >> >> > >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. > >> >> Wasted a lot of time already on this today so the release candidate > is > >> >> going to have to wait until this is sorted out and passing cleanly. > >> > > >> > > >> > >> Perhaps it should attempt lxml and fall back on BS? When lxml succeeds > >> it is much faster. > > > > > > https://gist.github.com/wesm/5695768 > > In [3]: import lxml.etree as etree > > In [4]: etree.__version__ > Out[4]: u'3.2.1' > > libxml2 version 2.9.0. i can upgrade if you think it might be that > > In [5]: import bs4 > > In [6]: bs4.__version__ > Out[6]: '4.2.1' > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Mon Jun 3 04:34:07 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Sun, 2 Jun 2013 22:34:07 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: Sorry that should be from lxml.html import parse -- Best, Phillip Cloud On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud wrote: > saw that u fixed the first test. second is correctly failing because the > value retrieved is wrong. i replicated your setup sans libxml2 and nothing > fails. travis is passing these tests, so i'm not sure exactly what the > issue is. can you try the following > > from lxml import parse > url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' > doc = parse(url) > len(doc.xpath('.//table')) > 0 > > from bs4 import BeautifulSoup > from contextlib import closing > from urllib2 import urlopen > with contextlib.closing(urllib2.urlopen(url)) as f: > soup = BeautifulSoup(f.read(), features='lxml') > > len(soup.find_all('table')) > 0 > > > -- > Best, > Phillip Cloud > > > On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney wrote: > >> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud wrote: >> > yeah that's better than dumping it altogether. you can use a strict >> parser >> > that doesn't try to recover broken html. btw what tests are breaking? i >> > can't get any of them to break... >> > >> > >> > -- >> > Best, >> > Phillip Cloud >> > >> > >> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney >> wrote: >> >> >> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud >> wrote: >> >> > That is strange. Can you give me the gist of what the traceback is? >> I'm >> >> > using the same except my lxml is 2.9.1 but that shouldn't matter. I >> vote >> >> > to >> >> > get rid of the lxml functionality since it's not going to parse >> invalid >> >> > html, which is what most of the web consists of. >> >> > >> >> > >> >> > -- >> >> > Best, >> >> > Phillip Cloud >> >> > >> >> > >> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney >> >> > wrote: >> >> >> >> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud >> >> >> wrote: >> >> >> > This is the reply I got from the lxml people about an "incorrect" >> >> >> > parse >> >> >> > of >> >> >> > the failed bank list page. It wasn't actually an incorrect parse, >> the >> >> >> > page >> >> >> > has invalid markup and lxml makes no promises about that. Moral of >> >> >> > the >> >> >> > story: only use html5lib when parsing HTML tables. Should I reopen >> >> >> > the >> >> >> > lxml >> >> >> > functionality then, with a big honking error in the documentation >> >> >> > telling >> >> >> > users to tidy up the HTML they want to parse if they want to use >> lxml >> >> >> > or >> >> >> > just scrap the lxml functionality entirely? No need to clutter up >> the >> >> >> > codebase. >> >> >> > >> >> >> > -- >> >> >> > Best, >> >> >> > Phillip Cloud >> >> >> > >> >> >> > >> >> >> > ---------- Forwarded message ---------- >> >> >> > From: scoder <1181905 at bugs.launchpad.net> >> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly >> >> >> > To: cpcloud at gmail.com >> >> >> > >> >> >> > >> >> >> > The HTML page doesn't validate, even my browser shows me an HTML >> >> >> > error. >> >> >> > The tag you are looking for is not inside of a tag, so >> it's >> >> >> > actually correct that the last two tests in your script fail >> because >> >> >> > they are looking for something that's not there. >> >> >> > >> >> >> > If you think that the parser in libxml2 should be able to fix this >> >> >> > HTML >> >> >> > error automatically, rather than just parsing through it, please >> file >> >> >> > a >> >> >> > bug report for the libxml2 project. Alternatively, adapt your >> script >> >> >> > to >> >> >> > the broken HTML or use an HTML tidying tool to fix the markup. >> >> >> > >> >> >> > >> >> >> > ** Changed in: lxml >> >> >> > Status: New => Invalid >> >> >> > >> >> >> > -- >> >> >> > You received this bug notification because you are subscribed to >> the >> >> >> > bug >> >> >> > report. >> >> >> > https://bugs.launchpad.net/bugs/1181905 >> >> >> > >> >> >> > Title: >> >> >> > tr elements are not parsed correctly >> >> >> > >> >> >> > Status in lxml - the Python XML toolkit: >> >> >> > Invalid >> >> >> > >> >> >> > Bug description: >> >> >> > Python : sys.version_info(major=2, minor=7, >> micro=5, >> >> >> > releaselevel='final', serial=0) >> >> >> > lxml.etree : (3, 2, 1, 0) >> >> >> > libxml used : (2, 9, 1) >> >> >> > libxml compiled : (2, 9, 1) >> >> >> > libxslt used : (1, 1, 28) >> >> >> > libxslt compiled : (1, 1, 28) >> >> >> > >> >> >> > See the attached script. The url >> >> >> > http://www.fdic.gov/bank/individual/failed/banklist.html is not >> >> >> > parsed >> >> >> > correctly by lxml. the element containing 'Gold Canyon' is just >> >> >> > left >> >> >> > out, while all of the other elements seem to be there. >> >> >> > >> >> >> > To manage notifications about this bug go to: >> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >> >> >> > >> >> >> > >> >> >> > _______________________________________________ >> >> >> > Pandas-dev mailing list >> >> >> > Pandas-dev at python.org >> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >> >> >> > >> >> >> >> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. >> >> >> Wasted a lot of time already on this today so the release candidate >> is >> >> >> going to have to wait until this is sorted out and passing cleanly. >> >> > >> >> > >> >> >> >> Perhaps it should attempt lxml and fall back on BS? When lxml succeeds >> >> it is much faster. >> > >> > >> >> https://gist.github.com/wesm/5695768 >> >> In [3]: import lxml.etree as etree >> >> In [4]: etree.__version__ >> Out[4]: u'3.2.1' >> >> libxml2 version 2.9.0. i can upgrade if you think it might be that >> >> In [5]: import bs4 >> >> In [6]: bs4.__version__ >> Out[6]: '4.2.1' >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Mon Jun 3 04:37:21 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Sun, 2 Jun 2013 22:37:21 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: here the gist of the working code https://gist.github.com/cpcloud/5695835 -- Best, Phillip Cloud On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud wrote: > Sorry that should be from lxml.html import parse > > > -- > Best, > Phillip Cloud > > > On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud wrote: > >> saw that u fixed the first test. second is correctly failing because the >> value retrieved is wrong. i replicated your setup sans libxml2 and nothing >> fails. travis is passing these tests, so i'm not sure exactly what the >> issue is. can you try the following >> >> from lxml import parse >> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' >> doc = parse(url) >> len(doc.xpath('.//table')) > 0 >> >> from bs4 import BeautifulSoup >> from contextlib import closing >> from urllib2 import urlopen >> with contextlib.closing(urllib2.urlopen(url)) as f: >> soup = BeautifulSoup(f.read(), features='lxml') >> >> len(soup.find_all('table')) > 0 >> >> >> -- >> Best, >> Phillip Cloud >> >> >> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney wrote: >> >>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud wrote: >>> > yeah that's better than dumping it altogether. you can use a strict >>> parser >>> > that doesn't try to recover broken html. btw what tests are breaking? i >>> > can't get any of them to break... >>> > >>> > >>> > -- >>> > Best, >>> > Phillip Cloud >>> > >>> > >>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney >>> wrote: >>> >> >>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud >>> wrote: >>> >> > That is strange. Can you give me the gist of what the traceback >>> is? I'm >>> >> > using the same except my lxml is 2.9.1 but that shouldn't matter. I >>> vote >>> >> > to >>> >> > get rid of the lxml functionality since it's not going to parse >>> invalid >>> >> > html, which is what most of the web consists of. >>> >> > >>> >> > >>> >> > -- >>> >> > Best, >>> >> > Phillip Cloud >>> >> > >>> >> > >>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney >>> >> > wrote: >>> >> >> >>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud >>> >> >> wrote: >>> >> >> > This is the reply I got from the lxml people about an "incorrect" >>> >> >> > parse >>> >> >> > of >>> >> >> > the failed bank list page. It wasn't actually an incorrect >>> parse, the >>> >> >> > page >>> >> >> > has invalid markup and lxml makes no promises about that. Moral >>> of >>> >> >> > the >>> >> >> > story: only use html5lib when parsing HTML tables. Should I >>> reopen >>> >> >> > the >>> >> >> > lxml >>> >> >> > functionality then, with a big honking error in the documentation >>> >> >> > telling >>> >> >> > users to tidy up the HTML they want to parse if they want to use >>> lxml >>> >> >> > or >>> >> >> > just scrap the lxml functionality entirely? No need to clutter >>> up the >>> >> >> > codebase. >>> >> >> > >>> >> >> > -- >>> >> >> > Best, >>> >> >> > Phillip Cloud >>> >> >> > >>> >> >> > >>> >> >> > ---------- Forwarded message ---------- >>> >> >> > From: scoder <1181905 at bugs.launchpad.net> >>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly >>> >> >> > To: cpcloud at gmail.com >>> >> >> > >>> >> >> > >>> >> >> > The HTML page doesn't validate, even my browser shows me an HTML >>> >> >> > error. >>> >> >> > The tag you are looking for is not inside of a tag, so >>> it's >>> >> >> > actually correct that the last two tests in your script fail >>> because >>> >> >> > they are looking for something that's not there. >>> >> >> > >>> >> >> > If you think that the parser in libxml2 should be able to fix >>> this >>> >> >> > HTML >>> >> >> > error automatically, rather than just parsing through it, please >>> file >>> >> >> > a >>> >> >> > bug report for the libxml2 project. Alternatively, adapt your >>> script >>> >> >> > to >>> >> >> > the broken HTML or use an HTML tidying tool to fix the markup. >>> >> >> > >>> >> >> > >>> >> >> > ** Changed in: lxml >>> >> >> > Status: New => Invalid >>> >> >> > >>> >> >> > -- >>> >> >> > You received this bug notification because you are subscribed to >>> the >>> >> >> > bug >>> >> >> > report. >>> >> >> > https://bugs.launchpad.net/bugs/1181905 >>> >> >> > >>> >> >> > Title: >>> >> >> > tr elements are not parsed correctly >>> >> >> > >>> >> >> > Status in lxml - the Python XML toolkit: >>> >> >> > Invalid >>> >> >> > >>> >> >> > Bug description: >>> >> >> > Python : sys.version_info(major=2, minor=7, >>> micro=5, >>> >> >> > releaselevel='final', serial=0) >>> >> >> > lxml.etree : (3, 2, 1, 0) >>> >> >> > libxml used : (2, 9, 1) >>> >> >> > libxml compiled : (2, 9, 1) >>> >> >> > libxslt used : (1, 1, 28) >>> >> >> > libxslt compiled : (1, 1, 28) >>> >> >> > >>> >> >> > See the attached script. The url >>> >> >> > http://www.fdic.gov/bank/individual/failed/banklist.html is >>> not >>> >> >> > parsed >>> >> >> > correctly by lxml. the element containing 'Gold Canyon' is just >>> >> >> > left >>> >> >> > out, while all of the other elements seem to be there. >>> >> >> > >>> >> >> > To manage notifications about this bug go to: >>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >>> >> >> > >>> >> >> > >>> >> >> > _______________________________________________ >>> >> >> > Pandas-dev mailing list >>> >> >> > Pandas-dev at python.org >>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> > >>> >> >> >>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. >>> >> >> Wasted a lot of time already on this today so the release >>> candidate is >>> >> >> going to have to wait until this is sorted out and passing cleanly. >>> >> > >>> >> > >>> >> >>> >> Perhaps it should attempt lxml and fall back on BS? When lxml succeeds >>> >> it is much faster. >>> > >>> > >>> >>> https://gist.github.com/wesm/5695768 >>> >>> In [3]: import lxml.etree as etree >>> >>> In [4]: etree.__version__ >>> Out[4]: u'3.2.1' >>> >>> libxml2 version 2.9.0. i can upgrade if you think it might be that >>> >>> In [5]: import bs4 >>> >>> In [6]: bs4.__version__ >>> Out[6]: '4.2.1' >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Mon Jun 3 05:54:34 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Sun, 2 Jun 2013 23:54:34 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: wes there's an issue with your anaconda installation. run # make sure you're using the right conda environment here it tripped me up the first time pip uninstall lxml pip uninstall beautifulsoup pip uninstall beautifulsoup4 pip install lxml pip install beautifulsoup4 and try again -- Best, Phillip Cloud On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud wrote: > here the gist of the working code https://gist.github.com/cpcloud/5695835 > > > -- > Best, > Phillip Cloud > > > On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud wrote: > >> Sorry that should be from lxml.html import parse >> >> >> -- >> Best, >> Phillip Cloud >> >> >> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud wrote: >> >>> saw that u fixed the first test. second is correctly failing because the >>> value retrieved is wrong. i replicated your setup sans libxml2 and nothing >>> fails. travis is passing these tests, so i'm not sure exactly what the >>> issue is. can you try the following >>> >>> from lxml import parse >>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' >>> doc = parse(url) >>> len(doc.xpath('.//table')) > 0 >>> >>> from bs4 import BeautifulSoup >>> from contextlib import closing >>> from urllib2 import urlopen >>> with contextlib.closing(urllib2.urlopen(url)) as f: >>> soup = BeautifulSoup(f.read(), features='lxml') >>> >>> len(soup.find_all('table')) > 0 >>> >>> >>> -- >>> Best, >>> Phillip Cloud >>> >>> >>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney wrote: >>> >>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud >>>> wrote: >>>> > yeah that's better than dumping it altogether. you can use a strict >>>> parser >>>> > that doesn't try to recover broken html. btw what tests are breaking? >>>> i >>>> > can't get any of them to break... >>>> > >>>> > >>>> > -- >>>> > Best, >>>> > Phillip Cloud >>>> > >>>> > >>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney >>>> wrote: >>>> >> >>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud >>>> wrote: >>>> >> > That is strange. Can you give me the gist of what the traceback >>>> is? I'm >>>> >> > using the same except my lxml is 2.9.1 but that shouldn't matter. >>>> I vote >>>> >> > to >>>> >> > get rid of the lxml functionality since it's not going to parse >>>> invalid >>>> >> > html, which is what most of the web consists of. >>>> >> > >>>> >> > >>>> >> > -- >>>> >> > Best, >>>> >> > Phillip Cloud >>>> >> > >>>> >> > >>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney >>>> >> > wrote: >>>> >> >> >>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud >>>> >> >> wrote: >>>> >> >> > This is the reply I got from the lxml people about an >>>> "incorrect" >>>> >> >> > parse >>>> >> >> > of >>>> >> >> > the failed bank list page. It wasn't actually an incorrect >>>> parse, the >>>> >> >> > page >>>> >> >> > has invalid markup and lxml makes no promises about that. Moral >>>> of >>>> >> >> > the >>>> >> >> > story: only use html5lib when parsing HTML tables. Should I >>>> reopen >>>> >> >> > the >>>> >> >> > lxml >>>> >> >> > functionality then, with a big honking error in the >>>> documentation >>>> >> >> > telling >>>> >> >> > users to tidy up the HTML they want to parse if they want to >>>> use lxml >>>> >> >> > or >>>> >> >> > just scrap the lxml functionality entirely? No need to clutter >>>> up the >>>> >> >> > codebase. >>>> >> >> > >>>> >> >> > -- >>>> >> >> > Best, >>>> >> >> > Phillip Cloud >>>> >> >> > >>>> >> >> > >>>> >> >> > ---------- Forwarded message ---------- >>>> >> >> > From: scoder <1181905 at bugs.launchpad.net> >>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly >>>> >> >> > To: cpcloud at gmail.com >>>> >> >> > >>>> >> >> > >>>> >> >> > The HTML page doesn't validate, even my browser shows me an HTML >>>> >> >> > error. >>>> >> >> > The tag you are looking for is not inside of a tag, >>>> so it's >>>> >> >> > actually correct that the last two tests in your script fail >>>> because >>>> >> >> > they are looking for something that's not there. >>>> >> >> > >>>> >> >> > If you think that the parser in libxml2 should be able to fix >>>> this >>>> >> >> > HTML >>>> >> >> > error automatically, rather than just parsing through it, >>>> please file >>>> >> >> > a >>>> >> >> > bug report for the libxml2 project. Alternatively, adapt your >>>> script >>>> >> >> > to >>>> >> >> > the broken HTML or use an HTML tidying tool to fix the markup. >>>> >> >> > >>>> >> >> > >>>> >> >> > ** Changed in: lxml >>>> >> >> > Status: New => Invalid >>>> >> >> > >>>> >> >> > -- >>>> >> >> > You received this bug notification because you are subscribed >>>> to the >>>> >> >> > bug >>>> >> >> > report. >>>> >> >> > https://bugs.launchpad.net/bugs/1181905 >>>> >> >> > >>>> >> >> > Title: >>>> >> >> > tr elements are not parsed correctly >>>> >> >> > >>>> >> >> > Status in lxml - the Python XML toolkit: >>>> >> >> > Invalid >>>> >> >> > >>>> >> >> > Bug description: >>>> >> >> > Python : sys.version_info(major=2, minor=7, >>>> micro=5, >>>> >> >> > releaselevel='final', serial=0) >>>> >> >> > lxml.etree : (3, 2, 1, 0) >>>> >> >> > libxml used : (2, 9, 1) >>>> >> >> > libxml compiled : (2, 9, 1) >>>> >> >> > libxslt used : (1, 1, 28) >>>> >> >> > libxslt compiled : (1, 1, 28) >>>> >> >> > >>>> >> >> > See the attached script. The url >>>> >> >> > http://www.fdic.gov/bank/individual/failed/banklist.html is >>>> not >>>> >> >> > parsed >>>> >> >> > correctly by lxml. the element containing 'Gold Canyon' is >>>> just >>>> >> >> > left >>>> >> >> > out, while all of the other elements seem to be there. >>>> >> >> > >>>> >> >> > To manage notifications about this bug go to: >>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >>>> >> >> > >>>> >> >> > >>>> >> >> > _______________________________________________ >>>> >> >> > Pandas-dev mailing list >>>> >> >> > Pandas-dev at python.org >>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >> > >>>> >> >> >>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 >>>> 2.9.0. >>>> >> >> Wasted a lot of time already on this today so the release >>>> candidate is >>>> >> >> going to have to wait until this is sorted out and passing >>>> cleanly. >>>> >> > >>>> >> > >>>> >> >>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml >>>> succeeds >>>> >> it is much faster. >>>> > >>>> > >>>> >>>> https://gist.github.com/wesm/5695768 >>>> >>>> In [3]: import lxml.etree as etree >>>> >>>> In [4]: etree.__version__ >>>> Out[4]: u'3.2.1' >>>> >>>> libxml2 version 2.9.0. i can upgrade if you think it might be that >>>> >>>> In [5]: import bs4 >>>> >>>> In [6]: bs4.__version__ >>>> Out[6]: '4.2.1' >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Mon Jun 3 06:26:15 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Mon, 3 Jun 2013 00:26:15 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: alright i've spent 2 hours tracking this down and here are the results for anaconda lxml 3.2.1 works but 3.2.0 doesn't. for a regular virtualenv 3.2.0 works fine (so does 3.2.1) travis is passing these tests so i think there's something weird with anaconda's path stuff i'm not sure what the issue is there. could be a path issue somewhere, but frankly this is not worth spending any more time on. should i add something to the docs along the lines of if you're using anaconda and you want lxml, then use version 3.2.1? an additional bug sprang up which is that the tests are run if lxml installed but not bs4 (they should run in this case), this i will fix and submit a pr. -- Best, Phillip Cloud On Sun, Jun 2, 2013 at 11:54 PM, Phillip Cloud wrote: > wes there's an issue with your anaconda installation. run > > # make sure you're using the right conda environment here it tripped me up > the first time > > pip uninstall lxml > pip uninstall beautifulsoup > pip uninstall beautifulsoup4 > pip install lxml > pip install beautifulsoup4 > > and try again > > > -- > Best, > Phillip Cloud > > > On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud wrote: > >> here the gist of the working code https://gist.github.com/cpcloud/5695835 >> >> >> -- >> Best, >> Phillip Cloud >> >> >> On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud wrote: >> >>> Sorry that should be from lxml.html import parse >>> >>> >>> -- >>> Best, >>> Phillip Cloud >>> >>> >>> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud wrote: >>> >>>> saw that u fixed the first test. second is correctly failing because >>>> the value retrieved is wrong. i replicated your setup sans libxml2 and >>>> nothing fails. travis is passing these tests, so i'm not sure exactly what >>>> the issue is. can you try the following >>>> >>>> from lxml import parse >>>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' >>>> doc = parse(url) >>>> len(doc.xpath('.//table')) > 0 >>>> >>>> from bs4 import BeautifulSoup >>>> from contextlib import closing >>>> from urllib2 import urlopen >>>> with contextlib.closing(urllib2.urlopen(url)) as f: >>>> soup = BeautifulSoup(f.read(), features='lxml') >>>> >>>> len(soup.find_all('table')) > 0 >>>> >>>> >>>> -- >>>> Best, >>>> Phillip Cloud >>>> >>>> >>>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney wrote: >>>> >>>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud >>>>> wrote: >>>>> > yeah that's better than dumping it altogether. you can use a strict >>>>> parser >>>>> > that doesn't try to recover broken html. btw what tests are >>>>> breaking? i >>>>> > can't get any of them to break... >>>>> > >>>>> > >>>>> > -- >>>>> > Best, >>>>> > Phillip Cloud >>>>> > >>>>> > >>>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney >>>>> wrote: >>>>> >> >>>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud >>>>> wrote: >>>>> >> > That is strange. Can you give me the gist of what the traceback >>>>> is? I'm >>>>> >> > using the same except my lxml is 2.9.1 but that shouldn't matter. >>>>> I vote >>>>> >> > to >>>>> >> > get rid of the lxml functionality since it's not going to parse >>>>> invalid >>>>> >> > html, which is what most of the web consists of. >>>>> >> > >>>>> >> > >>>>> >> > -- >>>>> >> > Best, >>>>> >> > Phillip Cloud >>>>> >> > >>>>> >> > >>>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney >>>> > >>>>> >> > wrote: >>>>> >> >> >>>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud >>>> > >>>>> >> >> wrote: >>>>> >> >> > This is the reply I got from the lxml people about an >>>>> "incorrect" >>>>> >> >> > parse >>>>> >> >> > of >>>>> >> >> > the failed bank list page. It wasn't actually an incorrect >>>>> parse, the >>>>> >> >> > page >>>>> >> >> > has invalid markup and lxml makes no promises about that. >>>>> Moral of >>>>> >> >> > the >>>>> >> >> > story: only use html5lib when parsing HTML tables. Should I >>>>> reopen >>>>> >> >> > the >>>>> >> >> > lxml >>>>> >> >> > functionality then, with a big honking error in the >>>>> documentation >>>>> >> >> > telling >>>>> >> >> > users to tidy up the HTML they want to parse if they want to >>>>> use lxml >>>>> >> >> > or >>>>> >> >> > just scrap the lxml functionality entirely? No need to clutter >>>>> up the >>>>> >> >> > codebase. >>>>> >> >> > >>>>> >> >> > -- >>>>> >> >> > Best, >>>>> >> >> > Phillip Cloud >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > ---------- Forwarded message ---------- >>>>> >> >> > From: scoder <1181905 at bugs.launchpad.net> >>>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >>>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly >>>>> >> >> > To: cpcloud at gmail.com >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > The HTML page doesn't validate, even my browser shows me an >>>>> HTML >>>>> >> >> > error. >>>>> >> >> > The tag you are looking for is not inside of a tag, >>>>> so it's >>>>> >> >> > actually correct that the last two tests in your script fail >>>>> because >>>>> >> >> > they are looking for something that's not there. >>>>> >> >> > >>>>> >> >> > If you think that the parser in libxml2 should be able to fix >>>>> this >>>>> >> >> > HTML >>>>> >> >> > error automatically, rather than just parsing through it, >>>>> please file >>>>> >> >> > a >>>>> >> >> > bug report for the libxml2 project. Alternatively, adapt your >>>>> script >>>>> >> >> > to >>>>> >> >> > the broken HTML or use an HTML tidying tool to fix the markup. >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > ** Changed in: lxml >>>>> >> >> > Status: New => Invalid >>>>> >> >> > >>>>> >> >> > -- >>>>> >> >> > You received this bug notification because you are subscribed >>>>> to the >>>>> >> >> > bug >>>>> >> >> > report. >>>>> >> >> > https://bugs.launchpad.net/bugs/1181905 >>>>> >> >> > >>>>> >> >> > Title: >>>>> >> >> > tr elements are not parsed correctly >>>>> >> >> > >>>>> >> >> > Status in lxml - the Python XML toolkit: >>>>> >> >> > Invalid >>>>> >> >> > >>>>> >> >> > Bug description: >>>>> >> >> > Python : sys.version_info(major=2, minor=7, >>>>> micro=5, >>>>> >> >> > releaselevel='final', serial=0) >>>>> >> >> > lxml.etree : (3, 2, 1, 0) >>>>> >> >> > libxml used : (2, 9, 1) >>>>> >> >> > libxml compiled : (2, 9, 1) >>>>> >> >> > libxslt used : (1, 1, 28) >>>>> >> >> > libxslt compiled : (1, 1, 28) >>>>> >> >> > >>>>> >> >> > See the attached script. The url >>>>> >> >> > http://www.fdic.gov/bank/individual/failed/banklist.html is >>>>> not >>>>> >> >> > parsed >>>>> >> >> > correctly by lxml. the element containing 'Gold Canyon' is >>>>> just >>>>> >> >> > left >>>>> >> >> > out, while all of the other elements seem to be there. >>>>> >> >> > >>>>> >> >> > To manage notifications about this bug go to: >>>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > _______________________________________________ >>>>> >> >> > Pandas-dev mailing list >>>>> >> >> > Pandas-dev at python.org >>>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >>>>> >> >> > >>>>> >> >> >>>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 >>>>> 2.9.0. >>>>> >> >> Wasted a lot of time already on this today so the release >>>>> candidate is >>>>> >> >> going to have to wait until this is sorted out and passing >>>>> cleanly. >>>>> >> > >>>>> >> > >>>>> >> >>>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml >>>>> succeeds >>>>> >> it is much faster. >>>>> > >>>>> > >>>>> >>>>> https://gist.github.com/wesm/5695768 >>>>> >>>>> In [3]: import lxml.etree as etree >>>>> >>>>> In [4]: etree.__version__ >>>>> Out[4]: u'3.2.1' >>>>> >>>>> libxml2 version 2.9.0. i can upgrade if you think it might be that >>>>> >>>>> In [5]: import bs4 >>>>> >>>>> In [6]: bs4.__version__ >>>>> Out[6]: '4.2.1' >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Mon Jun 3 07:52:42 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Mon, 3 Jun 2013 01:52:42 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: ok i spent another 2 hours on this out of curiosity and frustration and because i hate magic like this. i tried all of these outside of anaconda it's not the libxml2 version, i tried 2.8.0, 2.9.0, and 2.9.1 it's not the bs4 version, i tried 4.2.0 and 4.2.1 it's not the lxml version, i tried 3.2.0 and 3.2.1 the only time lxml + bs4 breaks is in anaconda + bs4 + lxml 3.2.0 there's an issue with the markup too, i'll update it but again there's no way to control the validity of other people's markup. the failed ban klist and the python xy plugins tables are both invalid pages so there are no promises for lxml. i will also make the change to allow users the choice of whichever they want to use, but i really think if lxml raises an XMLSyntaxError then pandas should NOT try to use html5lib, the user should be made aware of what they are doing, namely that the page they are trying to parse is invalid and that they should explicitly pass flavor='html5lib' if they want to parse the page. they would have to install html5lib anyway to get the former behavior. since most of the web is crap html i really think there's a minor benefit to including a fast parser when most of the time it will just be unable to parse a page and thus it will be fast at determining that it cannot parse the page. i don't know for sure but i doubt there are many huge html tables out there that are contained in valid html. anyway users can use html5lib + bs4 themselves to clean the markup and parse that with lxml if they are going to store it, but that's useless too since you can put it in a format that is easier to parse as soon as it's in the frame wes, i know u have the ultimate say and of course i will go along with whatever you think is best for pandas, just wanted to give my 2c. i'm happy to hear other opinions as well -- Best, Phillip Cloud On Mon, Jun 3, 2013 at 12:26 AM, Phillip Cloud wrote: > alright i've spent 2 hours tracking this down and here are the results > > for anaconda lxml 3.2.1 works but 3.2.0 doesn't. > for a regular virtualenv 3.2.0 works fine (so does 3.2.1) > travis is passing these tests so i think there's something weird with > anaconda's path stuff > > i'm not sure what the issue is there. could be a path issue somewhere, but > frankly this is not worth spending any more time on. > > should i add something to the docs along the lines of if you're using > anaconda and you want lxml, then use version 3.2.1? > > an additional bug sprang up which is that the tests are run if lxml > installed but not bs4 (they should run in this case), this i will fix and > submit a pr. > > -- > Best, > Phillip Cloud > > > On Sun, Jun 2, 2013 at 11:54 PM, Phillip Cloud wrote: > >> wes there's an issue with your anaconda installation. run >> >> # make sure you're using the right conda environment here it tripped me >> up the first time >> >> pip uninstall lxml >> pip uninstall beautifulsoup >> pip uninstall beautifulsoup4 >> pip install lxml >> pip install beautifulsoup4 >> >> and try again >> >> >> -- >> Best, >> Phillip Cloud >> >> >> On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud wrote: >> >>> here the gist of the working code >>> https://gist.github.com/cpcloud/5695835 >>> >>> >>> -- >>> Best, >>> Phillip Cloud >>> >>> >>> On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud wrote: >>> >>>> Sorry that should be from lxml.html import parse >>>> >>>> >>>> -- >>>> Best, >>>> Phillip Cloud >>>> >>>> >>>> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud wrote: >>>> >>>>> saw that u fixed the first test. second is correctly failing because >>>>> the value retrieved is wrong. i replicated your setup sans libxml2 and >>>>> nothing fails. travis is passing these tests, so i'm not sure exactly what >>>>> the issue is. can you try the following >>>>> >>>>> from lxml import parse >>>>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' >>>>> doc = parse(url) >>>>> len(doc.xpath('.//table')) > 0 >>>>> >>>>> from bs4 import BeautifulSoup >>>>> from contextlib import closing >>>>> from urllib2 import urlopen >>>>> with contextlib.closing(urllib2.urlopen(url)) as f: >>>>> soup = BeautifulSoup(f.read(), features='lxml') >>>>> >>>>> len(soup.find_all('table')) > 0 >>>>> >>>>> >>>>> -- >>>>> Best, >>>>> Phillip Cloud >>>>> >>>>> >>>>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney wrote: >>>>> >>>>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud >>>>>> wrote: >>>>>> > yeah that's better than dumping it altogether. you can use a strict >>>>>> parser >>>>>> > that doesn't try to recover broken html. btw what tests are >>>>>> breaking? i >>>>>> > can't get any of them to break... >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > Best, >>>>>> > Phillip Cloud >>>>>> > >>>>>> > >>>>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney >>>>>> wrote: >>>>>> >> >>>>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud >>>>>> wrote: >>>>>> >> > That is strange. Can you give me the gist of what the traceback >>>>>> is? I'm >>>>>> >> > using the same except my lxml is 2.9.1 but that shouldn't >>>>>> matter. I vote >>>>>> >> > to >>>>>> >> > get rid of the lxml functionality since it's not going to parse >>>>>> invalid >>>>>> >> > html, which is what most of the web consists of. >>>>>> >> > >>>>>> >> > >>>>>> >> > -- >>>>>> >> > Best, >>>>>> >> > Phillip Cloud >>>>>> >> > >>>>>> >> > >>>>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney < >>>>>> wesmckinn at gmail.com> >>>>>> >> > wrote: >>>>>> >> >> >>>>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud < >>>>>> cpcloud at gmail.com> >>>>>> >> >> wrote: >>>>>> >> >> > This is the reply I got from the lxml people about an >>>>>> "incorrect" >>>>>> >> >> > parse >>>>>> >> >> > of >>>>>> >> >> > the failed bank list page. It wasn't actually an incorrect >>>>>> parse, the >>>>>> >> >> > page >>>>>> >> >> > has invalid markup and lxml makes no promises about that. >>>>>> Moral of >>>>>> >> >> > the >>>>>> >> >> > story: only use html5lib when parsing HTML tables. Should I >>>>>> reopen >>>>>> >> >> > the >>>>>> >> >> > lxml >>>>>> >> >> > functionality then, with a big honking error in the >>>>>> documentation >>>>>> >> >> > telling >>>>>> >> >> > users to tidy up the HTML they want to parse if they want to >>>>>> use lxml >>>>>> >> >> > or >>>>>> >> >> > just scrap the lxml functionality entirely? No need to >>>>>> clutter up the >>>>>> >> >> > codebase. >>>>>> >> >> > >>>>>> >> >> > -- >>>>>> >> >> > Best, >>>>>> >> >> > Phillip Cloud >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > ---------- Forwarded message ---------- >>>>>> >> >> > From: scoder <1181905 at bugs.launchpad.net> >>>>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >>>>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed >>>>>> correctly >>>>>> >> >> > To: cpcloud at gmail.com >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > The HTML page doesn't validate, even my browser shows me an >>>>>> HTML >>>>>> >> >> > error. >>>>>> >> >> > The tag you are looking for is not inside of a tag, >>>>>> so it's >>>>>> >> >> > actually correct that the last two tests in your script fail >>>>>> because >>>>>> >> >> > they are looking for something that's not there. >>>>>> >> >> > >>>>>> >> >> > If you think that the parser in libxml2 should be able to fix >>>>>> this >>>>>> >> >> > HTML >>>>>> >> >> > error automatically, rather than just parsing through it, >>>>>> please file >>>>>> >> >> > a >>>>>> >> >> > bug report for the libxml2 project. Alternatively, adapt your >>>>>> script >>>>>> >> >> > to >>>>>> >> >> > the broken HTML or use an HTML tidying tool to fix the markup. >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > ** Changed in: lxml >>>>>> >> >> > Status: New => Invalid >>>>>> >> >> > >>>>>> >> >> > -- >>>>>> >> >> > You received this bug notification because you are subscribed >>>>>> to the >>>>>> >> >> > bug >>>>>> >> >> > report. >>>>>> >> >> > https://bugs.launchpad.net/bugs/1181905 >>>>>> >> >> > >>>>>> >> >> > Title: >>>>>> >> >> > tr elements are not parsed correctly >>>>>> >> >> > >>>>>> >> >> > Status in lxml - the Python XML toolkit: >>>>>> >> >> > Invalid >>>>>> >> >> > >>>>>> >> >> > Bug description: >>>>>> >> >> > Python : sys.version_info(major=2, minor=7, >>>>>> micro=5, >>>>>> >> >> > releaselevel='final', serial=0) >>>>>> >> >> > lxml.etree : (3, 2, 1, 0) >>>>>> >> >> > libxml used : (2, 9, 1) >>>>>> >> >> > libxml compiled : (2, 9, 1) >>>>>> >> >> > libxslt used : (1, 1, 28) >>>>>> >> >> > libxslt compiled : (1, 1, 28) >>>>>> >> >> > >>>>>> >> >> > See the attached script. The url >>>>>> >> >> > http://www.fdic.gov/bank/individual/failed/banklist.htmlis not >>>>>> >> >> > parsed >>>>>> >> >> > correctly by lxml. the element containing 'Gold Canyon' is >>>>>> just >>>>>> >> >> > left >>>>>> >> >> > out, while all of the other elements seem to be there. >>>>>> >> >> > >>>>>> >> >> > To manage notifications about this bug go to: >>>>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > _______________________________________________ >>>>>> >> >> > Pandas-dev mailing list >>>>>> >> >> > Pandas-dev at python.org >>>>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >>>>>> >> >> > >>>>>> >> >> >>>>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 >>>>>> 2.9.0. >>>>>> >> >> Wasted a lot of time already on this today so the release >>>>>> candidate is >>>>>> >> >> going to have to wait until this is sorted out and passing >>>>>> cleanly. >>>>>> >> > >>>>>> >> > >>>>>> >> >>>>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml >>>>>> succeeds >>>>>> >> it is much faster. >>>>>> > >>>>>> > >>>>>> >>>>>> https://gist.github.com/wesm/5695768 >>>>>> >>>>>> In [3]: import lxml.etree as etree >>>>>> >>>>>> In [4]: etree.__version__ >>>>>> Out[4]: u'3.2.1' >>>>>> >>>>>> libxml2 version 2.9.0. i can upgrade if you think it might be that >>>>>> >>>>>> In [5]: import bs4 >>>>>> >>>>>> In [6]: bs4.__version__ >>>>>> Out[6]: '4.2.1' >>>>>> >>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jun 3 09:40:44 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 3 Jun 2013 00:40:44 -0700 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: On Sun, Jun 2, 2013 at 10:52 PM, Phillip Cloud wrote: > ok i spent another 2 hours on this out of curiosity and frustration and > because i hate magic like this. > > i tried all of these outside of anaconda > it's not the libxml2 version, i tried 2.8.0, 2.9.0, and 2.9.1 > it's not the bs4 version, i tried 4.2.0 and 4.2.1 > it's not the lxml version, i tried 3.2.0 and 3.2.1 > > the only time lxml + bs4 breaks is in anaconda + bs4 + lxml 3.2.0 > > there's an issue with the markup too, i'll update it but again there's no > way to control the validity of other people's markup. the failed ban klist > and the python xy plugins tables are both invalid pages so there are no > promises for lxml. i will also make the change to allow users the choice of > whichever they want to use, but i really think if lxml raises an > XMLSyntaxError then pandas should NOT try to use html5lib, the user should > be made aware of what they are doing, namely that the page they are trying > to parse is invalid and that they should explicitly pass flavor='html5lib' > if they want to parse the page. they would have to install html5lib anyway > to get the former behavior. > > since most of the web is crap html i really think there's a minor benefit to > including a fast parser when most of the time it will just be unable to > parse a page and thus it will be fast at determining that it cannot parse > the page. i don't know for sure but i doubt there are many huge html tables > out there that are contained in valid html. anyway users can use html5lib + > bs4 themselves to clean the markup and parse that with lxml if they are > going to store it, but that's useless too since you can put it in a format > that is easier to parse as soon as it's in the frame > > wes, i know u have the ultimate say and of course i will go along with > whatever you think is best for pandas, just wanted to give my 2c. i'm happy > to hear other opinions as well > -- > Best, > Phillip Cloud > > > On Mon, Jun 3, 2013 at 12:26 AM, Phillip Cloud wrote: >> >> alright i've spent 2 hours tracking this down and here are the results >> >> for anaconda lxml 3.2.1 works but 3.2.0 doesn't. >> for a regular virtualenv 3.2.0 works fine (so does 3.2.1) >> travis is passing these tests so i think there's something weird with >> anaconda's path stuff >> >> i'm not sure what the issue is there. could be a path issue somewhere, but >> frankly this is not worth spending any more time on. >> >> should i add something to the docs along the lines of if you're using >> anaconda and you want lxml, then use version 3.2.1? >> >> an additional bug sprang up which is that the tests are run if lxml >> installed but not bs4 (they should run in this case), this i will fix and >> submit a pr. >> >> -- >> Best, >> Phillip Cloud >> >> >> On Sun, Jun 2, 2013 at 11:54 PM, Phillip Cloud wrote: >>> >>> wes there's an issue with your anaconda installation. run >>> >>> # make sure you're using the right conda environment here it tripped me >>> up the first time >>> >>> pip uninstall lxml >>> pip uninstall beautifulsoup >>> pip uninstall beautifulsoup4 >>> pip install lxml >>> pip install beautifulsoup4 >>> >>> and try again >>> >>> >>> -- >>> Best, >>> Phillip Cloud >>> >>> >>> On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud wrote: >>>> >>>> here the gist of the working code >>>> https://gist.github.com/cpcloud/5695835 >>>> >>>> >>>> -- >>>> Best, >>>> Phillip Cloud >>>> >>>> >>>> On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud >>>> wrote: >>>>> >>>>> Sorry that should be from lxml.html import parse >>>>> >>>>> >>>>> -- >>>>> Best, >>>>> Phillip Cloud >>>>> >>>>> >>>>> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud >>>>> wrote: >>>>>> >>>>>> saw that u fixed the first test. second is correctly failing because >>>>>> the value retrieved is wrong. i replicated your setup sans libxml2 and >>>>>> nothing fails. travis is passing these tests, so i'm not sure exactly what >>>>>> the issue is. can you try the following >>>>>> >>>>>> from lxml import parse >>>>>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' >>>>>> doc = parse(url) >>>>>> len(doc.xpath('.//table')) > 0 >>>>>> >>>>>> from bs4 import BeautifulSoup >>>>>> from contextlib import closing >>>>>> from urllib2 import urlopen >>>>>> with contextlib.closing(urllib2.urlopen(url)) as f: >>>>>> soup = BeautifulSoup(f.read(), features='lxml') >>>>>> >>>>>> len(soup.find_all('table')) > 0 >>>>>> >>>>>> >>>>>> -- >>>>>> Best, >>>>>> Phillip Cloud >>>>>> >>>>>> >>>>>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney >>>>>> wrote: >>>>>>> >>>>>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud >>>>>>> wrote: >>>>>>> > yeah that's better than dumping it altogether. you can use a strict >>>>>>> > parser >>>>>>> > that doesn't try to recover broken html. btw what tests are >>>>>>> > breaking? i >>>>>>> > can't get any of them to break... >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Best, >>>>>>> > Phillip Cloud >>>>>>> > >>>>>>> > >>>>>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney >>>>>>> > wrote: >>>>>>> >> >>>>>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud >>>>>>> >> wrote: >>>>>>> >> > That is strange. Can you give me the gist of what the traceback >>>>>>> >> > is? I'm >>>>>>> >> > using the same except my lxml is 2.9.1 but that shouldn't >>>>>>> >> > matter. I vote >>>>>>> >> > to >>>>>>> >> > get rid of the lxml functionality since it's not going to parse >>>>>>> >> > invalid >>>>>>> >> > html, which is what most of the web consists of. >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > -- >>>>>>> >> > Best, >>>>>>> >> > Phillip Cloud >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney >>>>>>> >> > >>>>>>> >> > wrote: >>>>>>> >> >> >>>>>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud >>>>>>> >> >> >>>>>>> >> >> wrote: >>>>>>> >> >> > This is the reply I got from the lxml people about an >>>>>>> >> >> > "incorrect" >>>>>>> >> >> > parse >>>>>>> >> >> > of >>>>>>> >> >> > the failed bank list page. It wasn't actually an incorrect >>>>>>> >> >> > parse, the >>>>>>> >> >> > page >>>>>>> >> >> > has invalid markup and lxml makes no promises about that. >>>>>>> >> >> > Moral of >>>>>>> >> >> > the >>>>>>> >> >> > story: only use html5lib when parsing HTML tables. Should I >>>>>>> >> >> > reopen >>>>>>> >> >> > the >>>>>>> >> >> > lxml >>>>>>> >> >> > functionality then, with a big honking error in the >>>>>>> >> >> > documentation >>>>>>> >> >> > telling >>>>>>> >> >> > users to tidy up the HTML they want to parse if they want to >>>>>>> >> >> > use lxml >>>>>>> >> >> > or >>>>>>> >> >> > just scrap the lxml functionality entirely? No need to >>>>>>> >> >> > clutter up the >>>>>>> >> >> > codebase. >>>>>>> >> >> > >>>>>>> >> >> > -- >>>>>>> >> >> > Best, >>>>>>> >> >> > Phillip Cloud >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > ---------- Forwarded message ---------- >>>>>>> >> >> > From: scoder <1181905 at bugs.launchpad.net> >>>>>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >>>>>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed >>>>>>> >> >> > correctly >>>>>>> >> >> > To: cpcloud at gmail.com >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > The HTML page doesn't validate, even my browser shows me an >>>>>>> >> >> > HTML >>>>>>> >> >> > error. >>>>>>> >> >> > The tag you are looking for is not inside of a tag, >>>>>>> >> >> > so it's >>>>>>> >> >> > actually correct that the last two tests in your script fail >>>>>>> >> >> > because >>>>>>> >> >> > they are looking for something that's not there. >>>>>>> >> >> > >>>>>>> >> >> > If you think that the parser in libxml2 should be able to fix >>>>>>> >> >> > this >>>>>>> >> >> > HTML >>>>>>> >> >> > error automatically, rather than just parsing through it, >>>>>>> >> >> > please file >>>>>>> >> >> > a >>>>>>> >> >> > bug report for the libxml2 project. Alternatively, adapt your >>>>>>> >> >> > script >>>>>>> >> >> > to >>>>>>> >> >> > the broken HTML or use an HTML tidying tool to fix the >>>>>>> >> >> > markup. >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > ** Changed in: lxml >>>>>>> >> >> > Status: New => Invalid >>>>>>> >> >> > >>>>>>> >> >> > -- >>>>>>> >> >> > You received this bug notification because you are subscribed >>>>>>> >> >> > to the >>>>>>> >> >> > bug >>>>>>> >> >> > report. >>>>>>> >> >> > https://bugs.launchpad.net/bugs/1181905 >>>>>>> >> >> > >>>>>>> >> >> > Title: >>>>>>> >> >> > tr elements are not parsed correctly >>>>>>> >> >> > >>>>>>> >> >> > Status in lxml - the Python XML toolkit: >>>>>>> >> >> > Invalid >>>>>>> >> >> > >>>>>>> >> >> > Bug description: >>>>>>> >> >> > Python : sys.version_info(major=2, minor=7, >>>>>>> >> >> > micro=5, >>>>>>> >> >> > releaselevel='final', serial=0) >>>>>>> >> >> > lxml.etree : (3, 2, 1, 0) >>>>>>> >> >> > libxml used : (2, 9, 1) >>>>>>> >> >> > libxml compiled : (2, 9, 1) >>>>>>> >> >> > libxslt used : (1, 1, 28) >>>>>>> >> >> > libxslt compiled : (1, 1, 28) >>>>>>> >> >> > >>>>>>> >> >> > See the attached script. The url >>>>>>> >> >> > http://www.fdic.gov/bank/individual/failed/banklist.html is >>>>>>> >> >> > not >>>>>>> >> >> > parsed >>>>>>> >> >> > correctly by lxml. the element containing 'Gold Canyon' is >>>>>>> >> >> > just >>>>>>> >> >> > left >>>>>>> >> >> > out, while all of the other elements seem to be there. >>>>>>> >> >> > >>>>>>> >> >> > To manage notifications about this bug go to: >>>>>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > _______________________________________________ >>>>>>> >> >> > Pandas-dev mailing list >>>>>>> >> >> > Pandas-dev at python.org >>>>>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 >>>>>>> >> >> 2.9.0. >>>>>>> >> >> Wasted a lot of time already on this today so the release >>>>>>> >> >> candidate is >>>>>>> >> >> going to have to wait until this is sorted out and passing >>>>>>> >> >> cleanly. >>>>>>> >> > >>>>>>> >> > >>>>>>> >> >>>>>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml >>>>>>> >> succeeds >>>>>>> >> it is much faster. >>>>>>> > >>>>>>> > >>>>>>> >>>>>>> https://gist.github.com/wesm/5695768 >>>>>>> >>>>>>> In [3]: import lxml.etree as etree >>>>>>> >>>>>>> In [4]: etree.__version__ >>>>>>> Out[4]: u'3.2.1' >>>>>>> >>>>>>> libxml2 version 2.9.0. i can upgrade if you think it might be that >>>>>>> >>>>>>> In [5]: import bs4 >>>>>>> >>>>>>> In [6]: bs4.__version__ >>>>>>> Out[6]: '4.2.1' >>>>>> >>>>>> >>>>> >>>> >>> >> > Nuking all the anaconda xml/xslt libs and rebuilding did the trip, thanks for all the forensic investigation. I'll see about cutting the RC tomorrow. From jeffreback at gmail.com Mon Jun 3 13:01:56 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 3 Jun 2013 07:01:56 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: phillip might make sense to have a Gotchas section in the docs (after io/HTML/Read Html) which shows known configurations that work and your conda environment workaround.... and a short disclaimer on how lxml only deals with properly format XML, while html5lib is more robust.... I can be reached on my cell 917-971-6387 On Jun 3, 2013, at 1:52 AM, Phillip Cloud wrote: > ok i spent another 2 hours on this out of curiosity and frustration and because i hate magic like this. > > i tried all of these outside of anaconda > it's not the libxml2 version, i tried 2.8.0, 2.9.0, and 2.9.1 > it's not the bs4 version, i tried 4.2.0 and 4.2.1 > it's not the lxml version, i tried 3.2.0 and 3.2.1 > > the only time lxml + bs4 breaks is in anaconda + bs4 + lxml 3.2.0 > > there's an issue with the markup too, i'll update it but again there's no way to control the validity of other people's markup. the failed ban klist and the python xy plugins tables are both invalid pages so there are no promises for lxml. i will also make the change to allow users the choice of whichever they want to use, but i really think if lxml raises an XMLSyntaxError then pandas should NOT try to use html5lib, the user should be made aware of what they are doing, namely that the page they are trying to parse is invalid and that they should explicitly pass flavor='html5lib' if they want to parse the page. they would have to install html5lib anyway to get the former behavior. > > since most of the web is crap html i really think there's a minor benefit to including a fast parser when most of the time it will just be unable to parse a page and thus it will be fast at determining that it cannot parse the page. i don't know for sure but i doubt there are many huge html tables out there that are contained in valid html. anyway users can use html5lib + bs4 themselves to clean the markup and parse that with lxml if they are going to store it, but that's useless too since you can put it in a format that is easier to parse as soon as it's in the frame > > wes, i know u have the ultimate say and of course i will go along with whatever you think is best for pandas, just wanted to give my 2c. i'm happy to hear other opinions as well > -- > Best, > Phillip Cloud > > > On Mon, Jun 3, 2013 at 12:26 AM, Phillip Cloud wrote: >> alright i've spent 2 hours tracking this down and here are the results >> >> for anaconda lxml 3.2.1 works but 3.2.0 doesn't. >> for a regular virtualenv 3.2.0 works fine (so does 3.2.1) >> travis is passing these tests so i think there's something weird with anaconda's path stuff >> >> i'm not sure what the issue is there. could be a path issue somewhere, but frankly this is not worth spending any more time on. >> >> should i add something to the docs along the lines of if you're using anaconda and you want lxml, then use version 3.2.1? >> >> an additional bug sprang up which is that the tests are run if lxml installed but not bs4 (they should run in this case), this i will fix and submit a pr. >> >> -- >> Best, >> Phillip Cloud >> >> >> On Sun, Jun 2, 2013 at 11:54 PM, Phillip Cloud wrote: >>> wes there's an issue with your anaconda installation. run >>> >>> # make sure you're using the right conda environment here it tripped me up the first time >>> >>> pip uninstall lxml >>> pip uninstall beautifulsoup >>> pip uninstall beautifulsoup4 >>> pip install lxml >>> pip install beautifulsoup4 >>> >>> and try again >>> >>> >>> -- >>> Best, >>> Phillip Cloud >>> >>> >>> On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud wrote: >>>> here the gist of the working code https://gist.github.com/cpcloud/5695835 >>>> >>>> >>>> -- >>>> Best, >>>> Phillip Cloud >>>> >>>> >>>> On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud wrote: >>>>> Sorry that should be from lxml.html import parse >>>>> >>>>> >>>>> -- >>>>> Best, >>>>> Phillip Cloud >>>>> >>>>> >>>>> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud wrote: >>>>>> saw that u fixed the first test. second is correctly failing because the value retrieved is wrong. i replicated your setup sans libxml2 and nothing fails. travis is passing these tests, so i'm not sure exactly what the issue is. can you try the following >>>>>> >>>>>> from lxml import parse >>>>>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' >>>>>> doc = parse(url) >>>>>> len(doc.xpath('.//table')) > 0 >>>>>> >>>>>> from bs4 import BeautifulSoup >>>>>> from contextlib import closing >>>>>> from urllib2 import urlopen >>>>>> with contextlib.closing(urllib2.urlopen(url)) as f: >>>>>> soup = BeautifulSoup(f.read(), features='lxml') >>>>>> >>>>>> len(soup.find_all('table')) > 0 >>>>>> >>>>>> >>>>>> -- >>>>>> Best, >>>>>> Phillip Cloud >>>>>> >>>>>> >>>>>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney wrote: >>>>>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud wrote: >>>>>>> > yeah that's better than dumping it altogether. you can use a strict parser >>>>>>> > that doesn't try to recover broken html. btw what tests are breaking? i >>>>>>> > can't get any of them to break... >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Best, >>>>>>> > Phillip Cloud >>>>>>> > >>>>>>> > >>>>>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney wrote: >>>>>>> >> >>>>>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud wrote: >>>>>>> >> > That is strange. Can you give me the gist of what the traceback is? I'm >>>>>>> >> > using the same except my lxml is 2.9.1 but that shouldn't matter. I vote >>>>>>> >> > to >>>>>>> >> > get rid of the lxml functionality since it's not going to parse invalid >>>>>>> >> > html, which is what most of the web consists of. >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > -- >>>>>>> >> > Best, >>>>>>> >> > Phillip Cloud >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney >>>>>>> >> > wrote: >>>>>>> >> >> >>>>>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud >>>>>>> >> >> wrote: >>>>>>> >> >> > This is the reply I got from the lxml people about an "incorrect" >>>>>>> >> >> > parse >>>>>>> >> >> > of >>>>>>> >> >> > the failed bank list page. It wasn't actually an incorrect parse, the >>>>>>> >> >> > page >>>>>>> >> >> > has invalid markup and lxml makes no promises about that. Moral of >>>>>>> >> >> > the >>>>>>> >> >> > story: only use html5lib when parsing HTML tables. Should I reopen >>>>>>> >> >> > the >>>>>>> >> >> > lxml >>>>>>> >> >> > functionality then, with a big honking error in the documentation >>>>>>> >> >> > telling >>>>>>> >> >> > users to tidy up the HTML they want to parse if they want to use lxml >>>>>>> >> >> > or >>>>>>> >> >> > just scrap the lxml functionality entirely? No need to clutter up the >>>>>>> >> >> > codebase. >>>>>>> >> >> > >>>>>>> >> >> > -- >>>>>>> >> >> > Best, >>>>>>> >> >> > Phillip Cloud >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > ---------- Forwarded message ---------- >>>>>>> >> >> > From: scoder <1181905 at bugs.launchpad.net> >>>>>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >>>>>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly >>>>>>> >> >> > To: cpcloud at gmail.com >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > The HTML page doesn't validate, even my browser shows me an HTML >>>>>>> >> >> > error. >>>>>>> >> >> > The tag you are looking for is not inside of a tag, so it's >>>>>>> >> >> > actually correct that the last two tests in your script fail because >>>>>>> >> >> > they are looking for something that's not there. >>>>>>> >> >> > >>>>>>> >> >> > If you think that the parser in libxml2 should be able to fix this >>>>>>> >> >> > HTML >>>>>>> >> >> > error automatically, rather than just parsing through it, please file >>>>>>> >> >> > a >>>>>>> >> >> > bug report for the libxml2 project. Alternatively, adapt your script >>>>>>> >> >> > to >>>>>>> >> >> > the broken HTML or use an HTML tidying tool to fix the markup. >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > ** Changed in: lxml >>>>>>> >> >> > Status: New => Invalid >>>>>>> >> >> > >>>>>>> >> >> > -- >>>>>>> >> >> > You received this bug notification because you are subscribed to the >>>>>>> >> >> > bug >>>>>>> >> >> > report. >>>>>>> >> >> > https://bugs.launchpad.net/bugs/1181905 >>>>>>> >> >> > >>>>>>> >> >> > Title: >>>>>>> >> >> > tr elements are not parsed correctly >>>>>>> >> >> > >>>>>>> >> >> > Status in lxml - the Python XML toolkit: >>>>>>> >> >> > Invalid >>>>>>> >> >> > >>>>>>> >> >> > Bug description: >>>>>>> >> >> > Python : sys.version_info(major=2, minor=7, micro=5, >>>>>>> >> >> > releaselevel='final', serial=0) >>>>>>> >> >> > lxml.etree : (3, 2, 1, 0) >>>>>>> >> >> > libxml used : (2, 9, 1) >>>>>>> >> >> > libxml compiled : (2, 9, 1) >>>>>>> >> >> > libxslt used : (1, 1, 28) >>>>>>> >> >> > libxslt compiled : (1, 1, 28) >>>>>>> >> >> > >>>>>>> >> >> > See the attached script. The url >>>>>>> >> >> > http://www.fdic.gov/bank/individual/failed/banklist.html is not >>>>>>> >> >> > parsed >>>>>>> >> >> > correctly by lxml. the element containing 'Gold Canyon' is just >>>>>>> >> >> > left >>>>>>> >> >> > out, while all of the other elements seem to be there. >>>>>>> >> >> > >>>>>>> >> >> > To manage notifications about this bug go to: >>>>>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > _______________________________________________ >>>>>>> >> >> > Pandas-dev mailing list >>>>>>> >> >> > Pandas-dev at python.org >>>>>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. >>>>>>> >> >> Wasted a lot of time already on this today so the release candidate is >>>>>>> >> >> going to have to wait until this is sorted out and passing cleanly. >>>>>>> >> > >>>>>>> >> > >>>>>>> >> >>>>>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml succeeds >>>>>>> >> it is much faster. >>>>>>> > >>>>>>> > >>>>>>> >>>>>>> https://gist.github.com/wesm/5695768 >>>>>>> >>>>>>> In [3]: import lxml.etree as etree >>>>>>> >>>>>>> In [4]: etree.__version__ >>>>>>> Out[4]: u'3.2.1' >>>>>>> >>>>>>> libxml2 version 2.9.0. i can upgrade if you think it might be that >>>>>>> >>>>>>> In [5]: import bs4 >>>>>>> >>>>>>> In [6]: bs4.__version__ >>>>>>> Out[6]: '4.2.1' > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Mon Jun 3 14:46:42 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Mon, 3 Jun 2013 08:46:42 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: i think maybe before the read html section would be good. that ok? i'll include an example of lxml failing in the docs. and i'll open up lxml functionality again. what's the consensus on what to do on a failed parse? should pandas: throw an error reminding the user that they have invalid markup and that they should pass html5lib and subsequently bail out? or if html5lib is installed try that, if it's not bail out with a nice error message. +1 for the former from me since passing lxml will lead to using html5lib in the vast majority of cases. -- Best, Phillip Cloud On Mon, Jun 3, 2013 at 7:01 AM, Jeff Reback wrote: > phillip > > > might make sense to have a Gotchas section in the docs (after io/HTML/Read > Html) > which shows known configurations that work and your conda environment > workaround.... > and a short disclaimer on how lxml only deals with properly format XML, > while html5lib is more robust.... > > I can be reached on my cell 917-971-6387 > > On Jun 3, 2013, at 1:52 AM, Phillip Cloud wrote: > > ok i spent another 2 hours on this out of curiosity and frustration and > because i hate magic like this. > > i tried all of these outside of anaconda > it's not the libxml2 version, i tried 2.8.0, 2.9.0, and 2.9.1 > it's not the bs4 version, i tried 4.2.0 and 4.2.1 > it's not the lxml version, i tried 3.2.0 and 3.2.1 > > the only time lxml + bs4 breaks is in anaconda + bs4 + lxml 3.2.0 > > there's an issue with the markup too, i'll update it but again there's no > way to control the validity of other people's markup. the failed ban klist > and the python xy plugins tables are both invalid pages so there are no > promises for lxml. i will also make the change to allow users the choice of > whichever they want to use, but i really think if lxml raises an > XMLSyntaxError then pandas should NOT try to use html5lib, the user should > be made aware of what they are doing, namely that the page they are trying > to parse is invalid and that they should explicitly pass flavor='html5lib' > if they want to parse the page. they would have to install html5lib anyway > to get the former behavior. > > since most of the web is crap html i really think there's a minor benefit > to including a fast parser when most of the time it will just be unable to > parse a page and thus it will be fast at determining that it cannot parse > the page. i don't know for sure but i doubt there are many huge html tables > out there that are contained in valid html. anyway users can use html5lib + > bs4 themselves to clean the markup and parse that with lxml if they are > going to store it, but that's useless too since you can put it in a format > that is easier to parse as soon as it's in the frame > > wes, i know u have the ultimate say and of course i will go along with > whatever you think is best for pandas, just wanted to give my 2c. i'm happy > to hear other opinions as well > -- > Best, > Phillip Cloud > > > On Mon, Jun 3, 2013 at 12:26 AM, Phillip Cloud wrote: > >> alright i've spent 2 hours tracking this down and here are the results >> >> for anaconda lxml 3.2.1 works but 3.2.0 doesn't. >> for a regular virtualenv 3.2.0 works fine (so does 3.2.1) >> travis is passing these tests so i think there's something weird with >> anaconda's path stuff >> >> i'm not sure what the issue is there. could be a path issue somewhere, >> but frankly this is not worth spending any more time on. >> >> should i add something to the docs along the lines of if you're using >> anaconda and you want lxml, then use version 3.2.1? >> >> an additional bug sprang up which is that the tests are run if lxml >> installed but not bs4 (they should run in this case), this i will fix and >> submit a pr. >> >> -- >> Best, >> Phillip Cloud >> >> >> On Sun, Jun 2, 2013 at 11:54 PM, Phillip Cloud wrote: >> >>> wes there's an issue with your anaconda installation. run >>> >>> # make sure you're using the right conda environment here it tripped me >>> up the first time >>> >>> pip uninstall lxml >>> pip uninstall beautifulsoup >>> pip uninstall beautifulsoup4 >>> pip install lxml >>> pip install beautifulsoup4 >>> >>> and try again >>> >>> >>> -- >>> Best, >>> Phillip Cloud >>> >>> >>> On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud wrote: >>> >>>> here the gist of the working code >>>> https://gist.github.com/cpcloud/5695835 >>>> >>>> >>>> -- >>>> Best, >>>> Phillip Cloud >>>> >>>> >>>> On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud wrote: >>>> >>>>> Sorry that should be from lxml.html import parse >>>>> >>>>> >>>>> -- >>>>> Best, >>>>> Phillip Cloud >>>>> >>>>> >>>>> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud wrote: >>>>> >>>>>> saw that u fixed the first test. second is correctly failing because >>>>>> the value retrieved is wrong. i replicated your setup sans libxml2 and >>>>>> nothing fails. travis is passing these tests, so i'm not sure exactly what >>>>>> the issue is. can you try the following >>>>>> >>>>>> from lxml import parse >>>>>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' >>>>>> doc = parse(url) >>>>>> len(doc.xpath('.//table')) > 0 >>>>>> >>>>>> from bs4 import BeautifulSoup >>>>>> from contextlib import closing >>>>>> from urllib2 import urlopen >>>>>> with contextlib.closing(urllib2.urlopen(url)) as f: >>>>>> soup = BeautifulSoup(f.read(), features='lxml') >>>>>> >>>>>> len(soup.find_all('table')) > 0 >>>>>> >>>>>> >>>>>> -- >>>>>> Best, >>>>>> Phillip Cloud >>>>>> >>>>>> >>>>>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney wrote: >>>>>> >>>>>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud >>>>>>> wrote: >>>>>>> > yeah that's better than dumping it altogether. you can use a >>>>>>> strict parser >>>>>>> > that doesn't try to recover broken html. btw what tests are >>>>>>> breaking? i >>>>>>> > can't get any of them to break... >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Best, >>>>>>> > Phillip Cloud >>>>>>> > >>>>>>> > >>>>>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney >>>>>>> wrote: >>>>>>> >> >>>>>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud >>>>>>> wrote: >>>>>>> >> > That is strange. Can you give me the gist of what the >>>>>>> traceback is? I'm >>>>>>> >> > using the same except my lxml is 2.9.1 but that shouldn't >>>>>>> matter. I vote >>>>>>> >> > to >>>>>>> >> > get rid of the lxml functionality since it's not going to parse >>>>>>> invalid >>>>>>> >> > html, which is what most of the web consists of. >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > -- >>>>>>> >> > Best, >>>>>>> >> > Phillip Cloud >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney < >>>>>>> wesmckinn at gmail.com> >>>>>>> >> > wrote: >>>>>>> >> >> >>>>>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud < >>>>>>> cpcloud at gmail.com> >>>>>>> >> >> wrote: >>>>>>> >> >> > This is the reply I got from the lxml people about an >>>>>>> "incorrect" >>>>>>> >> >> > parse >>>>>>> >> >> > of >>>>>>> >> >> > the failed bank list page. It wasn't actually an incorrect >>>>>>> parse, the >>>>>>> >> >> > page >>>>>>> >> >> > has invalid markup and lxml makes no promises about that. >>>>>>> Moral of >>>>>>> >> >> > the >>>>>>> >> >> > story: only use html5lib when parsing HTML tables. Should I >>>>>>> reopen >>>>>>> >> >> > the >>>>>>> >> >> > lxml >>>>>>> >> >> > functionality then, with a big honking error in the >>>>>>> documentation >>>>>>> >> >> > telling >>>>>>> >> >> > users to tidy up the HTML they want to parse if they want to >>>>>>> use lxml >>>>>>> >> >> > or >>>>>>> >> >> > just scrap the lxml functionality entirely? No need to >>>>>>> clutter up the >>>>>>> >> >> > codebase. >>>>>>> >> >> > >>>>>>> >> >> > -- >>>>>>> >> >> > Best, >>>>>>> >> >> > Phillip Cloud >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > ---------- Forwarded message ---------- >>>>>>> >> >> > From: scoder <1181905 at bugs.launchpad.net> >>>>>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >>>>>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed >>>>>>> correctly >>>>>>> >> >> > To: cpcloud at gmail.com >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > The HTML page doesn't validate, even my browser shows me an >>>>>>> HTML >>>>>>> >> >> > error. >>>>>>> >> >> > The tag you are looking for is not inside of a >>>>>>> tag, so it's >>>>>>> >> >> > actually correct that the last two tests in your script fail >>>>>>> because >>>>>>> >> >> > they are looking for something that's not there. >>>>>>> >> >> > >>>>>>> >> >> > If you think that the parser in libxml2 should be able to >>>>>>> fix this >>>>>>> >> >> > HTML >>>>>>> >> >> > error automatically, rather than just parsing through it, >>>>>>> please file >>>>>>> >> >> > a >>>>>>> >> >> > bug report for the libxml2 project. Alternatively, adapt >>>>>>> your script >>>>>>> >> >> > to >>>>>>> >> >> > the broken HTML or use an HTML tidying tool to fix the >>>>>>> markup. >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > ** Changed in: lxml >>>>>>> >> >> > Status: New => Invalid >>>>>>> >> >> > >>>>>>> >> >> > -- >>>>>>> >> >> > You received this bug notification because you are >>>>>>> subscribed to the >>>>>>> >> >> > bug >>>>>>> >> >> > report. >>>>>>> >> >> > https://bugs.launchpad.net/bugs/1181905 >>>>>>> >> >> > >>>>>>> >> >> > Title: >>>>>>> >> >> > tr elements are not parsed correctly >>>>>>> >> >> > >>>>>>> >> >> > Status in lxml - the Python XML toolkit: >>>>>>> >> >> > Invalid >>>>>>> >> >> > >>>>>>> >> >> > Bug description: >>>>>>> >> >> > Python : sys.version_info(major=2, minor=7, >>>>>>> micro=5, >>>>>>> >> >> > releaselevel='final', serial=0) >>>>>>> >> >> > lxml.etree : (3, 2, 1, 0) >>>>>>> >> >> > libxml used : (2, 9, 1) >>>>>>> >> >> > libxml compiled : (2, 9, 1) >>>>>>> >> >> > libxslt used : (1, 1, 28) >>>>>>> >> >> > libxslt compiled : (1, 1, 28) >>>>>>> >> >> > >>>>>>> >> >> > See the attached script. The url >>>>>>> >> >> > http://www.fdic.gov/bank/individual/failed/banklist.htmlis not >>>>>>> >> >> > parsed >>>>>>> >> >> > correctly by lxml. the element containing 'Gold Canyon' is >>>>>>> just >>>>>>> >> >> > left >>>>>>> >> >> > out, while all of the other elements seem to be there. >>>>>>> >> >> > >>>>>>> >> >> > To manage notifications about this bug go to: >>>>>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >>>>>>> >> >> > >>>>>>> >> >> > >>>>>>> >> >> > _______________________________________________ >>>>>>> >> >> > Pandas-dev mailing list >>>>>>> >> >> > Pandas-dev at python.org >>>>>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >>>>>>> >> >> > >>>>>>> >> >> >>>>>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 >>>>>>> 2.9.0. >>>>>>> >> >> Wasted a lot of time already on this today so the release >>>>>>> candidate is >>>>>>> >> >> going to have to wait until this is sorted out and passing >>>>>>> cleanly. >>>>>>> >> > >>>>>>> >> > >>>>>>> >> >>>>>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml >>>>>>> succeeds >>>>>>> >> it is much faster. >>>>>>> > >>>>>>> > >>>>>>> >>>>>>> https://gist.github.com/wesm/5695768 >>>>>>> >>>>>>> In [3]: import lxml.etree as etree >>>>>>> >>>>>>> In [4]: etree.__version__ >>>>>>> Out[4]: u'3.2.1' >>>>>>> >>>>>>> libxml2 version 2.9.0. i can upgrade if you think it might be that >>>>>>> >>>>>>> In [5]: import bs4 >>>>>>> >>>>>>> In [6]: bs4.__version__ >>>>>>> Out[6]: '4.2.1' >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Mon Jun 3 15:35:11 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 3 Jun 2013 09:35:11 -0400 Subject: [Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly In-Reply-To: References: <20130520030553.7389.41669.malonedeb@wampee.canonical.com> <20130602061417.6118.71824.malone@chaenomeles.canonical.com>

Message-ID: <166853E6-30EC-4F07-BF0D-27D387B59C78@gmail.com> this sounds reasonable to fallback if u can if not flavor/engine is specified go ahead and put the warnings at the top of the html section On Jun 3, 2013, at 8:46 AM, Phillip Cloud wrote: > i think maybe before the read html section would be good. that ok? i'll include an example of lxml failing in the docs. and i'll open up lxml functionality again. what's the consensus on what to do on a failed parse? should pandas: throw an error reminding the user that they have invalid markup and that they should pass html5lib and subsequently bail out? or if html5lib is installed try that, if it's not bail out with a nice error message. +1 for the former from me since passing lxml will lead to using html5lib in the vast majority of cases. > > > -- > Best, > Phillip Cloud > > > On Mon, Jun 3, 2013 at 7:01 AM, Jeff Reback wrote: >> phillip >> >> >> might make sense to have a Gotchas section in the docs (after io/HTML/Read Html) >> which shows known configurations that work and your conda environment workaround.... >> and a short disclaimer on how lxml only deals with properly format XML, while html5lib is more robust.... >> >> I can be reached on my cell 917-971-6387 >> >> On Jun 3, 2013, at 1:52 AM, Phillip Cloud wrote: >> >>> ok i spent another 2 hours on this out of curiosity and frustration and because i hate magic like this. >>> >>> i tried all of these outside of anaconda >>> it's not the libxml2 version, i tried 2.8.0, 2.9.0, and 2.9.1 >>> it's not the bs4 version, i tried 4.2.0 and 4.2.1 >>> it's not the lxml version, i tried 3.2.0 and 3.2.1 >>> >>> the only time lxml + bs4 breaks is in anaconda + bs4 + lxml 3.2.0 >>> >>> there's an issue with the markup too, i'll update it but again there's no way to control the validity of other people's markup. the failed ban klist and the python xy plugins tables are both invalid pages so there are no promises for lxml. i will also make the change to allow users the choice of whichever they want to use, but i really think if lxml raises an XMLSyntaxError then pandas should NOT try to use html5lib, the user should be made aware of what they are doing, namely that the page they are trying to parse is invalid and that they should explicitly pass flavor='html5lib' if they want to parse the page. they would have to install html5lib anyway to get the former behavior. >>> >>> since most of the web is crap html i really think there's a minor benefit to including a fast parser when most of the time it will just be unable to parse a page and thus it will be fast at determining that it cannot parse the page. i don't know for sure but i doubt there are many huge html tables out there that are contained in valid html. anyway users can use html5lib + bs4 themselves to clean the markup and parse that with lxml if they are going to store it, but that's useless too since you can put it in a format that is easier to parse as soon as it's in the frame >>> >>> wes, i know u have the ultimate say and of course i will go along with whatever you think is best for pandas, just wanted to give my 2c. i'm happy to hear other opinions as well >>> -- >>> Best, >>> Phillip Cloud >>> >>> >>> On Mon, Jun 3, 2013 at 12:26 AM, Phillip Cloud wrote: >>>> alright i've spent 2 hours tracking this down and here are the results >>>> >>>> for anaconda lxml 3.2.1 works but 3.2.0 doesn't. >>>> for a regular virtualenv 3.2.0 works fine (so does 3.2.1) >>>> travis is passing these tests so i think there's something weird with anaconda's path stuff >>>> >>>> i'm not sure what the issue is there. could be a path issue somewhere, but frankly this is not worth spending any more time on. >>>> >>>> should i add something to the docs along the lines of if you're using anaconda and you want lxml, then use version 3.2.1? >>>> >>>> an additional bug sprang up which is that the tests are run if lxml installed but not bs4 (they should run in this case), this i will fix and submit a pr. >>>> >>>> -- >>>> Best, >>>> Phillip Cloud >>>> >>>> >>>> On Sun, Jun 2, 2013 at 11:54 PM, Phillip Cloud wrote: >>>>> wes there's an issue with your anaconda installation. run >>>>> >>>>> # make sure you're using the right conda environment here it tripped me up the first time >>>>> >>>>> pip uninstall lxml >>>>> pip uninstall beautifulsoup >>>>> pip uninstall beautifulsoup4 >>>>> pip install lxml >>>>> pip install beautifulsoup4 >>>>> >>>>> and try again >>>>> >>>>> >>>>> -- >>>>> Best, >>>>> Phillip Cloud >>>>> >>>>> >>>>> On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud wrote: >>>>>> here the gist of the working code https://gist.github.com/cpcloud/5695835 >>>>>> >>>>>> >>>>>> -- >>>>>> Best, >>>>>> Phillip Cloud >>>>>> >>>>>> >>>>>> On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud wrote: >>>>>>> Sorry that should be from lxml.html import parse >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best, >>>>>>> Phillip Cloud >>>>>>> >>>>>>> >>>>>>> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud wrote: >>>>>>>> saw that u fixed the first test. second is correctly failing because the value retrieved is wrong. i replicated your setup sans libxml2 and nothing fails. travis is passing these tests, so i'm not sure exactly what the issue is. can you try the following >>>>>>>> >>>>>>>> from lxml import parse >>>>>>>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html' >>>>>>>> doc = parse(url) >>>>>>>> len(doc.xpath('.//table')) > 0 >>>>>>>> >>>>>>>> from bs4 import BeautifulSoup >>>>>>>> from contextlib import closing >>>>>>>> from urllib2 import urlopen >>>>>>>> with contextlib.closing(urllib2.urlopen(url)) as f: >>>>>>>> soup = BeautifulSoup(f.read(), features='lxml') >>>>>>>> >>>>>>>> len(soup.find_all('table')) > 0 >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best, >>>>>>>> Phillip Cloud >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney wrote: >>>>>>>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud wrote: >>>>>>>>> > yeah that's better than dumping it altogether. you can use a strict parser >>>>>>>>> > that doesn't try to recover broken html. btw what tests are breaking? i >>>>>>>>> > can't get any of them to break... >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > Best, >>>>>>>>> > Phillip Cloud >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney wrote: >>>>>>>>> >> >>>>>>>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud wrote: >>>>>>>>> >> > That is strange. Can you give me the gist of what the traceback is? I'm >>>>>>>>> >> > using the same except my lxml is 2.9.1 but that shouldn't matter. I vote >>>>>>>>> >> > to >>>>>>>>> >> > get rid of the lxml functionality since it's not going to parse invalid >>>>>>>>> >> > html, which is what most of the web consists of. >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > -- >>>>>>>>> >> > Best, >>>>>>>>> >> > Phillip Cloud >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney >>>>>>>>> >> > wrote: >>>>>>>>> >> >> >>>>>>>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud >>>>>>>>> >> >> wrote: >>>>>>>>> >> >> > This is the reply I got from the lxml people about an "incorrect" >>>>>>>>> >> >> > parse >>>>>>>>> >> >> > of >>>>>>>>> >> >> > the failed bank list page. It wasn't actually an incorrect parse, the >>>>>>>>> >> >> > page >>>>>>>>> >> >> > has invalid markup and lxml makes no promises about that. Moral of >>>>>>>>> >> >> > the >>>>>>>>> >> >> > story: only use html5lib when parsing HTML tables. Should I reopen >>>>>>>>> >> >> > the >>>>>>>>> >> >> > lxml >>>>>>>>> >> >> > functionality then, with a big honking error in the documentation >>>>>>>>> >> >> > telling >>>>>>>>> >> >> > users to tidy up the HTML they want to parse if they want to use lxml >>>>>>>>> >> >> > or >>>>>>>>> >> >> > just scrap the lxml functionality entirely? No need to clutter up the >>>>>>>>> >> >> > codebase. >>>>>>>>> >> >> > >>>>>>>>> >> >> > -- >>>>>>>>> >> >> > Best, >>>>>>>>> >> >> > Phillip Cloud >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > ---------- Forwarded message ---------- >>>>>>>>> >> >> > From: scoder <1181905 at bugs.launchpad.net> >>>>>>>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM >>>>>>>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly >>>>>>>>> >> >> > To: cpcloud at gmail.com >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > The HTML page doesn't validate, even my browser shows me an HTML >>>>>>>>> >> >> > error. >>>>>>>>> >> >> > The tag you are looking for is not inside of a tag, so it's >>>>>>>>> >> >> > actually correct that the last two tests in your script fail because >>>>>>>>> >> >> > they are looking for something that's not there. >>>>>>>>> >> >> > >>>>>>>>> >> >> > If you think that the parser in libxml2 should be able to fix this >>>>>>>>> >> >> > HTML >>>>>>>>> >> >> > error automatically, rather than just parsing through it, please file >>>>>>>>> >> >> > a >>>>>>>>> >> >> > bug report for the libxml2 project. Alternatively, adapt your script >>>>>>>>> >> >> > to >>>>>>>>> >> >> > the broken HTML or use an HTML tidying tool to fix the markup. >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > ** Changed in: lxml >>>>>>>>> >> >> > Status: New => Invalid >>>>>>>>> >> >> > >>>>>>>>> >> >> > -- >>>>>>>>> >> >> > You received this bug notification because you are subscribed to the >>>>>>>>> >> >> > bug >>>>>>>>> >> >> > report. >>>>>>>>> >> >> > https://bugs.launchpad.net/bugs/1181905 >>>>>>>>> >> >> > >>>>>>>>> >> >> > Title: >>>>>>>>> >> >> > tr elements are not parsed correctly >>>>>>>>> >> >> > >>>>>>>>> >> >> > Status in lxml - the Python XML toolkit: >>>>>>>>> >> >> > Invalid >>>>>>>>> >> >> > >>>>>>>>> >> >> > Bug description: >>>>>>>>> >> >> > Python : sys.version_info(major=2, minor=7, micro=5, >>>>>>>>> >> >> > releaselevel='final', serial=0) >>>>>>>>> >> >> > lxml.etree : (3, 2, 1, 0) >>>>>>>>> >> >> > libxml used : (2, 9, 1) >>>>>>>>> >> >> > libxml compiled : (2, 9, 1) >>>>>>>>> >> >> > libxslt used : (1, 1, 28) >>>>>>>>> >> >> > libxslt compiled : (1, 1, 28) >>>>>>>>> >> >> > >>>>>>>>> >> >> > See the attached script. The url >>>>>>>>> >> >> > http://www.fdic.gov/bank/individual/failed/banklist.html is not >>>>>>>>> >> >> > parsed >>>>>>>>> >> >> > correctly by lxml. the element containing 'Gold Canyon' is just >>>>>>>>> >> >> > left >>>>>>>>> >> >> > out, while all of the other elements seem to be there. >>>>>>>>> >> >> > >>>>>>>>> >> >> > To manage notifications about this bug go to: >>>>>>>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions >>>>>>>>> >> >> > >>>>>>>>> >> >> > >>>>>>>>> >> >> > _______________________________________________ >>>>>>>>> >> >> > Pandas-dev mailing list >>>>>>>>> >> >> > Pandas-dev at python.org >>>>>>>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >>>>>>>>> >> >> > >>>>>>>>> >> >> >>>>>>>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0. >>>>>>>>> >> >> Wasted a lot of time already on this today so the release candidate is >>>>>>>>> >> >> going to have to wait until this is sorted out and passing cleanly. >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> >> >>>>>>>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml succeeds >>>>>>>>> >> it is much faster. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> >>>>>>>>> https://gist.github.com/wesm/5695768 >>>>>>>>> >>>>>>>>> In [3]: import lxml.etree as etree >>>>>>>>> >>>>>>>>> In [4]: etree.__version__ >>>>>>>>> Out[4]: u'3.2.1' >>>>>>>>> >>>>>>>>> libxml2 version 2.9.0. i can upgrade if you think it might be that >>>>>>>>> >>>>>>>>> In [5]: import bs4 >>>>>>>>> >>>>>>>>> In [6]: bs4.__version__ >>>>>>>>> Out[6]: '4.2.1' >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> http://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Thu Jun 6 18:52:20 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Thu, 6 Jun 2013 12:52:20 -0400 Subject: [Pandas-dev] pydata membership Message-ID: Hey guys, I was wondering how I can become a member of the pydata organization on GitHub. Jeff and I had a 2 second conversation about this on GitHub , but I wasn't sure what the next steps were so I thought I would bring it up on the mailing list. -- Best, Phillip Cloud -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Thu Jun 6 19:56:29 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 6 Jun 2013 10:56:29 -0700 Subject: [Pandas-dev] pydata membership In-Reply-To: References: Message-ID: On Thu, Jun 6, 2013 at 9:52 AM, Phillip Cloud wrote: > Hey guys, I was wondering how I can become a member of the pydata > organization on GitHub. Jeff and I had a 2 second conversation about this on > GitHub, but I wasn't sure what the next steps were so I thought I would > bring it up on the mailing list. > > -- > Best, > Phillip Cloud > I'm comfortable adding you. With great power comes great responsibility > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > From cpcloud at gmail.com Thu Jun 6 20:10:01 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Thu, 6 Jun 2013 14:10:01 -0400 Subject: [Pandas-dev] pydata membership In-Reply-To: References:

Message-ID: Thanks Wes. On Jun 6, 2013 1:57 PM, "Wes McKinney" wrote: > On Thu, Jun 6, 2013 at 9:52 AM, Phillip Cloud wrote: > > Hey guys, I was wondering how I can become a member of the pydata > > organization on GitHub. Jeff and I had a 2 second conversation about > this on > > GitHub, but I wasn't sure what the next steps were so I thought I would > > bring it up on the mailing list. > > > > -- > > Best, > > Phillip Cloud > > > > I'm comfortable adding you. With great power comes great responsibility > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > http://mail.python.org/mailman/listinfo/pandas-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jreback at yahoo.com Thu Jun 6 21:58:29 2013 From: jreback at yahoo.com (Jeff Reback) Date: Thu, 6 Jun 2013 15:58:29 -0400 Subject: [Pandas-dev] 0.11.1 Message-ID: congrats to Phillip! all of the html parsing has been merged in and clean (and fallback is now pretty nice) docs have lots of warnings/instructions so seems ok other than the single remaining issue about plotting I think 0.11.1 ready 2 go From changshe at gmail.com Thu Jun 6 22:03:51 2013 From: changshe at gmail.com (Chang She) Date: Thu, 6 Jun 2013 13:03:51 -0700 Subject: [Pandas-dev] 0.11.1 In-Reply-To: References: Message-ID: i need to include some copyright information for the google oauth library used for google analytics integration. I'm doing that right now On Thu, Jun 6, 2013 at 12:58 PM, Jeff Reback wrote: > congrats to Phillip! > > all of the html parsing has been merged in and clean > (and fallback is now pretty nice) > docs have lots of warnings/instructions so seems ok > > other than the single remaining issue about plotting I think 0.11.1 ready 2 go > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev From jreback at yahoo.com Thu Jun 6 22:11:43 2013 From: jreback at yahoo.com (Jeff Reback) Date: Thu, 6 Jun 2013 16:11:43 -0400 Subject: [Pandas-dev] 0.11.1 In-Reply-To: References:

Message-ID: <4B61816D-81D8-4E39-8595-1CD4465CD16E@yahoo.com> I forgot to tag Dan Allen new filter method on groupby he's done just doing some docs pull #3680 On Jun 6, 2013, at 4:03 PM, Chang She wrote: > i need to include some copyright information for the google oauth > library used for google analytics integration. I'm doing that right > now > > On Thu, Jun 6, 2013 at 12:58 PM, Jeff Reback wrote: >> congrats to Phillip! >> >> all of the html parsing has been merged in and clean >> (and fallback is now pretty nice) >> docs have lots of warnings/instructions so seems ok >> >> other than the single remaining issue about plotting I think 0.11.1 ready 2 go >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev From jreback at yahoo.com Fri Jun 14 02:50:09 2013 From: jreback at yahoo.com (Jeff Reback) Date: Thu, 13 Jun 2013 20:50:09 -0400 Subject: [Pandas-dev] docs Message-ID: Chang can u install xclip on your doc build machine (I believe u r on Linux) the new clipboard in docs uses this thxs Jeff I can be reached on my cell 917-971-6387 From changshe at gmail.com Fri Jun 14 06:12:20 2013 From: changshe at gmail.com (Chang She) Date: Thu, 13 Jun 2013 21:12:20 -0700 Subject: [Pandas-dev] docs In-Reply-To: References: Message-ID: installed and re-running the docs build again On Thu, Jun 13, 2013 at 9:09 PM, Chang She wrote: > will do thanks > > On Thu, Jun 13, 2013 at 5:50 PM, Jeff Reback wrote: >> Chang >> >> can u install xclip on your doc build machine (I believe u r on Linux) >> >> the new clipboard in docs uses this >> thxs >> Jeff >> >> I can be reached on my cell 917-971-6387 >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev From changshe at gmail.com Fri Jun 14 06:09:27 2013 From: changshe at gmail.com (Chang She) Date: Thu, 13 Jun 2013 21:09:27 -0700 Subject: [Pandas-dev] docs In-Reply-To: References: Message-ID: will do thanks On Thu, Jun 13, 2013 at 5:50 PM, Jeff Reback wrote: > Chang > > can u install xclip on your doc build machine (I believe u r on Linux) > > the new clipboard in docs uses this > thxs > Jeff > > I can be reached on my cell 917-971-6387 > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Wed Jun 19 01:35:17 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 18 Jun 2013 16:35:17 -0700 Subject: [Pandas-dev] Getting my attention Message-ID: hey all, really excited to see so much rapid development going on in pandas the last few months. you guys are the best! I've figured out that I've been missing a lot of @wesm mentions since I created an email filter as the chattiness of github has increased (see attached plot of smoothed "issue chattiness over time"). Don't be less chatty! I finally fixed my gmail filter to not hide threads that contain an @wesm in them so I will be a bit more responsive to questions/queries from now on. thanks, Wes -------------- next part -------------- A non-text attachment was scrubbed... Name: chattiness.png Type: image/png Size: 12899 bytes Desc: not available URL: From jeffreback at gmail.com Fri Jun 21 23:37:27 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Fri, 21 Jun 2013 17:37:27 -0400 Subject: [Pandas-dev] docs - clipboard reading not working for some reason? Message-ID: change, I have xclip installed (also xsel), but don't think that's used.... http://pandas.pydata.org/pandas-docs/dev/io.html#clipboard -------------- next part -------------- An HTML attachment was scrubbed... URL: From yoval at gmx.com Sun Jun 23 14:47:25 2013 From: yoval at gmx.com (yoval p.) Date: Sun, 23 Jun 2013 15:47:25 +0300 Subject: [Pandas-dev] Getting more community feedback into the GH issue tracker Message-ID: <51C6EE5D.5080205@gmx.com> tl;dr version: how about adding a pd.report_issue()/pd.feedback() function? *steps on soapbox* 9 times out of 10, GH issues are what brings about code changes in pandas, whether it's bugfixes or features. I'm very proud to be a commiter on an OSS that has a proven track record of being extremely responsive to users in support and in feature enhancements. I think that's what makes pandas such a great OSS project just as much as technical excellence, versatility and plain ol' just feeling good in your hand (behave). How can get we get feedback from more users of pandas, specifically those without a GH account and turn it into GH issues? Experience shows it's surprisingly often that the issue/features suggested by users are fairly simple to implement, if only we hear about it. Those kinds of changes immediately translate into making our users more productive. Which is a large part of what fosters such a fanatical community, which we do love so. Not everyone has a GH account and while SO is great (+ @hayd, @jreback et al migrate things over as needed) that also covers only a certain kind of pandas user. Is it enough? are we missing out on our silent users? I suspect some (many?) pandas users hit a snag or have an idea for a feature and we never hear of it, to our dismay. We're certainly not anxious about having lots of open issues, so why not? Would integrating some feedback mechanism directly into the library help fix this? It could be a dialog or a web page, launchable directly from the console that either redirects users directly to the "new issue" page, but could also let them just send in some structured text that we can send somewhere and review. Bad idea? Good idea? is SO all we need? do you have a better idea? So many things in pandas are oneliners. why isn't feedback one as well? yoval From wesmckinn at gmail.com Tue Jun 25 07:32:59 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 24 Jun 2013 22:32:59 -0700 Subject: [Pandas-dev] Upcoming release as 0.12 instead? Message-ID: This release already getting quite large, and jam-packed with new functionality like pandas.json. The release numbers are indeed fairly arbitrary but there's good psychology around "the release merits an upgrade"; any thoughts? - Wes From wesmckinn at gmail.com Tue Jun 25 08:50:27 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 24 Jun 2013 23:50:27 -0700 Subject: [Pandas-dev] Getting more community feedback into the GH issue tracker In-Reply-To: <51C6EE5D.5080205@gmx.com> References: <51C6EE5D.5080205@gmx.com> Message-ID: On Sun, Jun 23, 2013 at 5:47 AM, yoval p. wrote: > tl;dr version: how about adding a pd.report_issue()/pd.feedback() function? > > *steps on soapbox* > > 9 times out of 10, GH issues are what brings about code changes > in pandas, whether it's bugfixes or features. I'm very proud to > be a commiter on an OSS that has a proven track record of being > extremely responsive to users in support and in feature enhancements. > I think that's what makes pandas such a great OSS project just as > much as technical excellence, versatility and plain ol' just feeling > good in your hand (behave). > > How can get we get feedback from more users of pandas, specifically > those without a GH account and turn it into GH issues? Experience > shows it's surprisingly often that the issue/features suggested by users > are fairly simple to implement, if only we hear about it. > Those kinds of changes immediately translate into making our users more > productive. Which is a large part of what fosters such a fanatical > community, which we do love so. > > Not everyone has a GH account and while SO is great (+ @hayd, @jreback > et al migrate things over as needed) that also covers only a certain > kind of pandas user. Is it enough? are we missing out on our silent users? > > I suspect some (many?) pandas users hit a snag or have an idea for a > feature and we never hear of it, to our dismay. We're certainly not > anxious about having lots of open issues, so why not? Would integrating > some feedback mechanism directly into the library help fix this? It > could be a dialog or a web page, launchable directly from the console > that either redirects users directly to the "new issue" page, but could > also let them just send in some structured text that we can send somewhere > and review. > > Bad idea? Good idea? is SO all we need? do you have a better idea? > So many things in pandas are oneliners. why isn't feedback one as well? > > yoval > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev SO definitely needs frequent users to catch all the little bugs and problems. The traffic has reached the point that I'm not able to read every question anymore (I'm amazed at you guys' stamina!). Having a built-in feedback/report_issue doesn't seem like a bad idea, I guess the main issue would be when the feedback is longer than a traceback or a couple of lines of text on the command line. FWIW it seems like SO could be way better than it is, user interface-wise and whatnot. It would be interesting if GitHub gave users a better way to ask questions and have the answers collected in some way that facilitates wiki/cookbook-building. There are a ton of nuggets of pandas-wisdom at this point floating around SO-- I'm often like "hey that's awesome" when I see various people's solutions to questions. - Wes From jeffreback at gmail.com Tue Jun 25 12:37:56 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 25 Jun 2013 06:37:56 -0400 Subject: [Pandas-dev] Getting more community feedback into the GH issue tracker In-Reply-To: References: <51C6EE5D.5080205@gmx.com> Message-ID: how about start with it just printing a link and message: Thank you are reporting an issue to Pandas. Please copy this link to your browser: link-to-new-issue-on-github and past in your comments and code Could generate a link that makes a new issue and categorizes it to (to feedback) wonder if git has an api for this At least for a start I can be reached on my cell 917-971-6387 On Jun 25, 2013, at 2:50 AM, Wes McKinney wrote: > On Sun, Jun 23, 2013 at 5:47 AM, yoval p. wrote: >> tl;dr version: how about adding a pd.report_issue()/pd.feedback() function? >> >> *steps on soapbox* >> >> 9 times out of 10, GH issues are what brings about code changes >> in pandas, whether it's bugfixes or features. I'm very proud to >> be a commiter on an OSS that has a proven track record of being >> extremely responsive to users in support and in feature enhancements. >> I think that's what makes pandas such a great OSS project just as >> much as technical excellence, versatility and plain ol' just feeling >> good in your hand (behave). >> >> How can get we get feedback from more users of pandas, specifically >> those without a GH account and turn it into GH issues? Experience >> shows it's surprisingly often that the issue/features suggested by users >> are fairly simple to implement, if only we hear about it. >> Those kinds of changes immediately translate into making our users more >> productive. Which is a large part of what fosters such a fanatical >> community, which we do love so. >> >> Not everyone has a GH account and while SO is great (+ @hayd, @jreback >> et al migrate things over as needed) that also covers only a certain >> kind of pandas user. Is it enough? are we missing out on our silent users? >> >> I suspect some (many?) pandas users hit a snag or have an idea for a >> feature and we never hear of it, to our dismay. We're certainly not >> anxious about having lots of open issues, so why not? Would integrating >> some feedback mechanism directly into the library help fix this? It >> could be a dialog or a web page, launchable directly from the console >> that either redirects users directly to the "new issue" page, but could >> also let them just send in some structured text that we can send somewhere >> and review. >> >> Bad idea? Good idea? is SO all we need? do you have a better idea? >> So many things in pandas are oneliners. why isn't feedback one as well? >> >> yoval >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev > > SO definitely needs frequent users to catch all the little bugs and > problems. The traffic has reached the point that I'm not able to read > every question anymore (I'm amazed at you guys' stamina!). > > Having a built-in feedback/report_issue doesn't seem like a bad idea, > I guess the main issue would be when the feedback is longer than a > traceback or a couple of lines of text on the command line. > > FWIW it seems like SO could be way better than it is, user > interface-wise and whatnot. It would be interesting if GitHub gave > users a better way to ask questions and have the answers collected in > some way that facilitates wiki/cookbook-building. There are a ton of > nuggets of pandas-wisdom at this point floating around SO-- I'm often > like "hey that's awesome" when I see various people's solutions to > questions. > > - Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev From yoval at gmx.com Tue Jun 25 15:16:19 2013 From: yoval at gmx.com (yoval p.) Date: Tue, 25 Jun 2013 16:16:19 +0300 Subject: [Pandas-dev] Upcoming release as 0.12 instead? In-Reply-To: References: Message-ID: <51C99823.8050001@gmx.com> On 06/25/2013 08:32 AM, Wes McKinney wrote: > This release already getting quite large, and jam-packed with new > functionality like pandas.json. The release numbers are indeed fairly > arbitrary but there's good psychology around "the release merits an > upgrade"; any thoughts? > > - Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > Sounds right to me. There was a window for 0.11.1 (fixing the display config misadventure, and to_csv() regressions) but git master as well past a bugfix release now. yp From cpcloud at gmail.com Tue Jun 25 22:07:08 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Tue, 25 Jun 2013 16:07:08 -0400 Subject: [Pandas-dev] Upcoming release as 0.12 instead? In-Reply-To: <51C99823.8050001@gmx.com> References: <51C99823.8050001@gmx.com> Message-ID: Should we make a new milestone for things that are 0.12 right now? 0.12.1 or 0.13? -- Best, Phillip Cloud On Tue, Jun 25, 2013 at 9:16 AM, yoval p. wrote: > On 06/25/2013 08:32 AM, Wes McKinney wrote: > > This release already getting quite large, and jam-packed with new > > functionality like pandas.json. The release numbers are indeed fairly > > arbitrary but there's good psychology around "the release merits an > > upgrade"; any thoughts? > > > > - Wes > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > http://mail.python.org/mailman/listinfo/pandas-dev > > > > Sounds right to me. There was a window for 0.11.1 (fixing the display > config misadventure, and to_csv() regressions) but git master as well > past a bugfix release now. > > yp > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Jun 25 22:13:10 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 25 Jun 2013 13:13:10 -0700 Subject: [Pandas-dev] Upcoming release as 0.12 instead? In-Reply-To: References: <51C99823.8050001@gmx.com> Message-ID: On Tue, Jun 25, 2013 at 1:07 PM, Phillip Cloud wrote: > Should we make a new milestone for things that are 0.12 right now? 0.12.1 or > 0.13? > > > -- > Best, > Phillip Cloud > > > On Tue, Jun 25, 2013 at 9:16 AM, yoval p. wrote: >> >> On 06/25/2013 08:32 AM, Wes McKinney wrote: >> > This release already getting quite large, and jam-packed with new >> > functionality like pandas.json. The release numbers are indeed fairly >> > arbitrary but there's good psychology around "the release merits an >> > upgrade"; any thoughts? >> > >> > - Wes >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > http://mail.python.org/mailman/listinfo/pandas-dev >> > >> >> Sounds right to me. There was a window for 0.11.1 (fixing the display >> config misadventure, and to_csv() regressions) but git master as well >> past a bugfix release now. >> >> yp >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > Already renamed the milestone to 0.13; the names aren't, um, set in stone =) From cpcloud at gmail.com Tue Jun 25 22:18:37 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Tue, 25 Jun 2013 16:18:37 -0400 Subject: [Pandas-dev] Upcoming release as 0.12 instead? In-Reply-To: References: <51C99823.8050001@gmx.com>

Message-ID: Great! Thanks. -- Best, Phillip Cloud On Tue, Jun 25, 2013 at 4:13 PM, Wes McKinney wrote: > On Tue, Jun 25, 2013 at 1:07 PM, Phillip Cloud wrote: > > Should we make a new milestone for things that are 0.12 right now? > 0.12.1 or > > 0.13? > > > > > > -- > > Best, > > Phillip Cloud > > > > > > On Tue, Jun 25, 2013 at 9:16 AM, yoval p. wrote: > >> > >> On 06/25/2013 08:32 AM, Wes McKinney wrote: > >> > This release already getting quite large, and jam-packed with new > >> > functionality like pandas.json. The release numbers are indeed fairly > >> > arbitrary but there's good psychology around "the release merits an > >> > upgrade"; any thoughts? > >> > > >> > - Wes > >> > _______________________________________________ > >> > Pandas-dev mailing list > >> > Pandas-dev at python.org > >> > http://mail.python.org/mailman/listinfo/pandas-dev > >> > > >> > >> Sounds right to me. There was a window for 0.11.1 (fixing the display > >> config misadventure, and to_csv() regressions) but git master as well > >> past a bugfix release now. > >> > >> yp > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> http://mail.python.org/mailman/listinfo/pandas-dev > > > > > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > http://mail.python.org/mailman/listinfo/pandas-dev > > > > Already renamed the milestone to 0.13; the names aren't, um, set in stone > =) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cpcloud at gmail.com Wed Jun 26 02:56:59 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Tue, 25 Jun 2013 20:56:59 -0400 Subject: [Pandas-dev] Upcoming release as 0.12 instead? In-Reply-To: References: <51C99823.8050001@gmx.com>

Message-ID: Should also change version numbers in release notes and what's new documents. I'll do it. -- Best, Phillip Cloud On Tue, Jun 25, 2013 at 4:18 PM, Phillip Cloud wrote: > Great! Thanks. > > > -- > Best, > Phillip Cloud > > > On Tue, Jun 25, 2013 at 4:13 PM, Wes McKinney wrote: > >> On Tue, Jun 25, 2013 at 1:07 PM, Phillip Cloud wrote: >> > Should we make a new milestone for things that are 0.12 right now? >> 0.12.1 or >> > 0.13? >> > >> > >> > -- >> > Best, >> > Phillip Cloud >> > >> > >> > On Tue, Jun 25, 2013 at 9:16 AM, yoval p. wrote: >> >> >> >> On 06/25/2013 08:32 AM, Wes McKinney wrote: >> >> > This release already getting quite large, and jam-packed with new >> >> > functionality like pandas.json. The release numbers are indeed fairly >> >> > arbitrary but there's good psychology around "the release merits an >> >> > upgrade"; any thoughts? >> >> > >> >> > - Wes >> >> > _______________________________________________ >> >> > Pandas-dev mailing list >> >> > Pandas-dev at python.org >> >> > http://mail.python.org/mailman/listinfo/pandas-dev >> >> > >> >> >> >> Sounds right to me. There was a window for 0.11.1 (fixing the display >> >> config misadventure, and to_csv() regressions) but git master as well >> >> past a bugfix release now. >> >> >> >> yp >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> http://mail.python.org/mailman/listinfo/pandas-dev >> > >> > >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > http://mail.python.org/mailman/listinfo/pandas-dev >> > >> >> Already renamed the milestone to 0.13; the names aren't, um, set in stone >> =) >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jreback at yahoo.com Wed Jun 26 03:05:04 2013 From: jreback at yahoo.com (Jeff Reback) Date: Tue, 25 Jun 2013 21:05:04 -0400 Subject: [Pandas-dev] can we revive the nightly builds? Message-ID: <74614D7E-5666-4A2C-AAC1-86D9772EA2EF@yahoo.com> https://github.com/pydata/pandas/issues/3777#issuecomment-20010594 From cpcloud at gmail.com Wed Jun 26 03:43:54 2013 From: cpcloud at gmail.com (Phillip Cloud) Date: Tue, 25 Jun 2013 21:43:54 -0400 Subject: [Pandas-dev] github pulse Message-ID: is there something weird in gh pulse? there's no way that i have 63 commits in the past week. i only have 98 commits overall and git log --oneline --no-merges --author='Phillip Cloud' --since='1 week ago' | wc -l yields 14 just want to make nothing fishy is going on... -- Best, Phillip Cloud -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Jun 26 16:21:27 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 26 Jun 2013 07:21:27 -0700 Subject: [Pandas-dev] can we revive the nightly builds? In-Reply-To: <74614D7E-5666-4A2C-AAC1-86D9772EA2EF@yahoo.com> References: <74614D7E-5666-4A2C-AAC1-86D9772EA2EF@yahoo.com> Message-ID: On Tue, Jun 25, 2013 at 6:05 PM, Jeff Reback wrote: > > https://github.com/pydata/pandas/issues/3777#issuecomment-20010594 > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev Chang, can you check on the jenkins box (I'm away at scipy) to see if it's a quick fix? From changshe at gmail.com Wed Jun 26 19:08:30 2013 From: changshe at gmail.com (Chang She) Date: Wed, 26 Jun 2013 10:08:30 -0700 Subject: [Pandas-dev] can we revive the nightly builds? In-Reply-To: References: <74614D7E-5666-4A2C-AAC1-86D9772EA2EF@yahoo.com> Message-ID: low disk space on C drive caused jenkins to blow up. fixing now On Wed, Jun 26, 2013 at 7:21 AM, Wes McKinney wrote: > On Tue, Jun 25, 2013 at 6:05 PM, Jeff Reback wrote: >> >> https://github.com/pydata/pandas/issues/3777#issuecomment-20010594 >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev > > Chang, can you check on the jenkins box (I'm away at scipy) to see if > it's a quick fix? > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Thu Jun 27 23:59:25 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 27 Jun 2013 14:59:25 -0700 Subject: [Pandas-dev] github pulse In-Reply-To: References: Message-ID: On Tue, Jun 25, 2013 at 6:43 PM, Phillip Cloud wrote: > is there something weird in gh pulse? there's no way that i have 63 commits > in the past week. i only have 98 commits overall and > > git log --oneline --no-merges --author='Phillip Cloud' --since='1 week ago' > | wc -l > > yields 14 > > just want to make nothing fishy is going on... > > -- > Best, > Phillip Cloud > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > Well, you know, GitHub doesn't know how to count higher than 1000 so not that surprised