[Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly

Mon Jun 3 14:46:42 CEST 2013

i think maybe before the read html section would be good. that ok? i'll
include an example of lxml failing in the docs. and i'll open up lxml
functionality again. what's the consensus on what to do on a failed parse?
should pandas: throw an error reminding the user that they have invalid
markup and that they should pass html5lib and subsequently bail out? or if
html5lib is installed try that, if it's not bail out with a nice error
message. +1 for the former from me since passing lxml will lead to using
html5lib in the vast majority of cases.

--
Best,
Phillip Cloud

On Mon, Jun 3, 2013 at 7:01 AM, Jeff Reback <jeffreback at gmail.com> wrote:

> phillip
>
>
> might make sense to have a Gotchas section in the docs (after io/HTML/Read
> Html)
> which shows known configurations that work and your conda environment
> workaround....
> and a short disclaimer on how lxml only deals with properly format XML,
> while html5lib is more robust....
>
> I can be reached on my cell 917-971-6387
>
> On Jun 3, 2013, at 1:52 AM, Phillip Cloud <cpcloud at gmail.com> wrote:
>
> ok i spent another 2 hours on this out of curiosity and frustration and
> because i hate magic like this.
>
> i tried all of these outside of anaconda
> it's not the libxml2 version, i tried 2.8.0, 2.9.0, and 2.9.1
> it's not the bs4 version, i tried 4.2.0 and 4.2.1
> it's not the lxml version, i tried 3.2.0 and 3.2.1
>
> the only time lxml + bs4 breaks is in anaconda + bs4 + lxml 3.2.0
>
> there's an issue with the markup too, i'll update it but again there's no
> way to control the validity of other people's markup. the failed ban klist
> and the python xy plugins tables are both invalid pages so there are no
> promises for lxml. i will also make the change to allow users the choice of
> whichever they want to use, but i really think if lxml raises an
> XMLSyntaxError then pandas should NOT try to use html5lib, the user should
> be made aware of what they are doing, namely that the page they are trying
> to parse is invalid and that they should explicitly pass flavor='html5lib'
> if they want to parse the page. they would have to install html5lib anyway
> to get the former behavior.
>
> since most of the web is crap html i really think there's a minor benefit
> to including a fast parser when most of the time it will just be unable to
> parse a page and thus it will be fast at determining that it cannot parse
> the page. i don't know for sure but i doubt there are many huge html tables
> out there that are contained in valid html. anyway users can use html5lib +
> bs4 themselves to clean the markup and parse that with lxml if they are
> going to store it, but that's useless too since you can put it in a format
> that is easier to parse as soon as it's in the frame
>
> wes, i know u have the ultimate say and of course i will go along with
> whatever you think is best for pandas, just wanted to give my 2c. i'm happy
> to hear other opinions as well
> --
> Best,
> Phillip Cloud
>
>
> On Mon, Jun 3, 2013 at 12:26 AM, Phillip Cloud <cpcloud at gmail.com> wrote:
>
>> alright i've spent 2 hours tracking this down and here are the results
>>
>> for anaconda lxml 3.2.1 works but 3.2.0 doesn't.
>> for a regular virtualenv 3.2.0 works fine (so does 3.2.1)
>> travis is passing these tests so i think there's something weird with
>> anaconda's path stuff
>>
>> i'm not sure what the issue is there. could be a path issue somewhere,
>> but frankly this is not worth spending any more time on.
>>
>> should i add something to the docs along the lines of if you're using
>> anaconda and you want lxml, then use version 3.2.1?
>>
>> an additional bug sprang up which is that the tests are run if lxml
>> installed but not bs4 (they should run in this case), this i will fix and
>> submit a pr.
>>
>> --
>> Best,
>> Phillip Cloud
>>
>>
>> On Sun, Jun 2, 2013 at 11:54 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>>
>>> wes there's an issue with your anaconda installation. run
>>>
>>> # make sure you're using the right conda environment here it tripped me
>>> up the first time
>>>
>>> pip uninstall lxml
>>> pip uninstall beautifulsoup
>>> pip uninstall beautifulsoup4
>>> pip install lxml
>>> pip install beautifulsoup4
>>>
>>> and try again
>>>
>>>
>>> --
>>> Best,
>>> Phillip Cloud
>>>
>>>
>>> On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud <cpcloud at gmail.com>wrote:
>>>
>>>> here the gist of the working code
>>>> https://gist.github.com/cpcloud/5695835
>>>>
>>>>
>>>> --
>>>> Best,
>>>> Phillip Cloud
>>>>
>>>>
>>>> On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud <cpcloud at gmail.com>wrote:
>>>>
>>>>> Sorry that should be from lxml.html import parse
>>>>>
>>>>>
>>>>> --
>>>>> Best,
>>>>> Phillip Cloud
>>>>>
>>>>>
>>>>> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud <cpcloud at gmail.com>wrote:
>>>>>
>>>>>> saw that u fixed the first test. second is correctly failing because
>>>>>> the value retrieved is wrong. i replicated your setup sans libxml2 and
>>>>>> nothing fails. travis is passing these tests, so i'm not sure exactly what
>>>>>> the issue is. can you try the following
>>>>>>
>>>>>> from lxml import parse
>>>>>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'
>>>>>> doc = parse(url)
>>>>>> len(doc.xpath('.//table')) > 0
>>>>>>
>>>>>> from bs4 import BeautifulSoup
>>>>>> from contextlib import closing
>>>>>> from urllib2 import urlopen
>>>>>> with contextlib.closing(urllib2.urlopen(url)) as f:
>>>>>>     soup = BeautifulSoup(f.read(), features='lxml')
>>>>>>
>>>>>> len(soup.find_all('table')) > 0
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best,
>>>>>> Phillip Cloud
>>>>>>
>>>>>>
>>>>>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney <wesmckinn at gmail.com>wrote:
>>>>>>
>>>>>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud <cpcloud at gmail.com>
>>>>>>> wrote:
>>>>>>> > yeah that's better than dumping it altogether. you can use a
>>>>>>> strict parser
>>>>>>> > that doesn't try to recover broken html. btw what tests are
>>>>>>> breaking? i
>>>>>>> > can't get any of them to break...
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Best,
>>>>>>> > Phillip Cloud
>>>>>>> >
>>>>>>> >
>>>>>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney <wesmckinn at gmail.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud <cpcloud at gmail.com>
>>>>>>> wrote:
>>>>>>> >> >  That is strange. Can you give me the gist of what the
>>>>>>> traceback is? I'm
>>>>>>> >> > using the same except my lxml is 2.9.1 but that shouldn't
>>>>>>> matter. I vote
>>>>>>> >> > to
>>>>>>> >> > get rid of the lxml functionality since it's not going to parse
>>>>>>> invalid
>>>>>>> >> > html, which is what most of the web consists of.
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > --
>>>>>>> >> > Best,
>>>>>>> >> > Phillip Cloud
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney <
>>>>>>> wesmckinn at gmail.com>
>>>>>>> >> > wrote:
>>>>>>> >> >>
>>>>>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud <
>>>>>>> cpcloud at gmail.com>
>>>>>>> >> >> wrote:
>>>>>>> >> >> > This is the reply I got from the lxml people about an
>>>>>>> "incorrect"
>>>>>>> >> >> > parse
>>>>>>> >> >> > of
>>>>>>> >> >> > the failed bank list page. It wasn't actually an incorrect
>>>>>>> parse, the
>>>>>>> >> >> > page
>>>>>>> >> >> > has invalid markup and lxml makes no promises about that.
>>>>>>> Moral of
>>>>>>> >> >> > the
>>>>>>> >> >> > story: only use html5lib when parsing HTML tables. Should I
>>>>>>> reopen
>>>>>>> >> >> > the
>>>>>>> >> >> > lxml
>>>>>>> >> >> > functionality then, with a big honking error in the
>>>>>>> documentation
>>>>>>> >> >> > telling
>>>>>>> >> >> > users to tidy up the HTML they want to parse if they want to
>>>>>>> use lxml
>>>>>>> >> >> > or
>>>>>>> >> >> > just scrap the lxml functionality entirely? No need to
>>>>>>> clutter up the
>>>>>>> >> >> > codebase.
>>>>>>> >> >> >
>>>>>>> >> >> > --
>>>>>>> >> >> > Best,
>>>>>>> >> >> > Phillip Cloud
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> > ---------- Forwarded message ----------
>>>>>>> >> >> > From: scoder <1181905 at bugs.launchpad.net>
>>>>>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM
>>>>>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed
>>>>>>> correctly
>>>>>>> >> >> > To: cpcloud at gmail.com
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> > The HTML page doesn't validate, even my browser shows me an
>>>>>>> HTML
>>>>>>> >> >> > error.
>>>>>>> >> >> > The <td> tag you are looking for is not inside of a <tr>
>>>>>>> tag, so it's
>>>>>>> >> >> > actually correct that the last two tests in your script fail
>>>>>>> because
>>>>>>> >> >> > they are looking for something that's not there.
>>>>>>> >> >> >
>>>>>>> >> >> > If you think that the parser in libxml2 should be able to
>>>>>>> fix this
>>>>>>> >> >> > HTML
>>>>>>> >> >> > error automatically, rather than just parsing through it,
>>>>>>> please file
>>>>>>> >> >> > a
>>>>>>> >> >> > bug report for the libxml2 project. Alternatively, adapt
>>>>>>> your script
>>>>>>> >> >> > to
>>>>>>> >> >> > the broken HTML or use an HTML tidying tool to fix the
>>>>>>> markup.
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> > ** Changed in: lxml
>>>>>>> >> >> >        Status: New => Invalid
>>>>>>> >> >> >
>>>>>>> >> >> > --
>>>>>>> >> >> > You received this bug notification because you are
>>>>>>> subscribed to the
>>>>>>> >> >> > bug
>>>>>>> >> >> > report.
>>>>>>> >> >> > https://bugs.launchpad.net/bugs/1181905
>>>>>>> >> >> >
>>>>>>> >> >> > Title:
>>>>>>> >> >> >   tr elements are not parsed correctly
>>>>>>> >> >> >
>>>>>>> >> >> > Status in lxml - the Python XML toolkit:
>>>>>>> >> >> >   Invalid
>>>>>>> >> >> >
>>>>>>> >> >> > Bug description:
>>>>>>> >> >> >   Python              : sys.version_info(major=2, minor=7,
>>>>>>> micro=5,
>>>>>>> >> >> > releaselevel='final', serial=0)
>>>>>>> >> >> >   lxml.etree          : (3, 2, 1, 0)
>>>>>>> >> >> >   libxml used         : (2, 9, 1)
>>>>>>> >> >> >   libxml compiled     : (2, 9, 1)
>>>>>>> >> >> >   libxslt used        : (1, 1, 28)
>>>>>>> >> >> >   libxslt compiled    : (1, 1, 28)
>>>>>>> >> >> >
>>>>>>> >> >> >   See the attached script. The url
>>>>>>> >> >> >   http://www.fdic.gov/bank/individual/failed/banklist.htmlis not
>>>>>>> >> >> > parsed
>>>>>>> >> >> >   correctly by lxml. the element containing 'Gold Canyon' is
>>>>>>> just
>>>>>>> >> >> > left
>>>>>>> >> >> >   out, while all of the other elements seem to be there.
>>>>>>> >> >> >
>>>>>>> >> >> > To manage notifications about this bug go to:
>>>>>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >> > _______________________________________________
>>>>>>> >> >> > Pandas-dev mailing list
>>>>>>> >> >> > Pandas-dev at python.org
>>>>>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>> >> >> >
>>>>>>> >> >>
>>>>>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2
>>>>>>> 2.9.0.
>>>>>>> >> >> Wasted a lot of time already on this today so the release
>>>>>>> candidate is
>>>>>>> >> >> going to have to wait until this is sorted out and passing
>>>>>>> cleanly.
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >>
>>>>>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml
>>>>>>> succeeds
>>>>>>> >> it is much faster.
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>> https://gist.github.com/wesm/5695768
>>>>>>>
>>>>>>> In [3]: import lxml.etree as etree
>>>>>>>
>>>>>>> In [4]: etree.__version__
>>>>>>> Out[4]: u'3.2.1'
>>>>>>>
>>>>>>> libxml2 version 2.9.0. i can upgrade if you think it might be that
>>>>>>>
>>>>>>> In [5]: import bs4
>>>>>>>
>>>>>>> In [6]: bs4.__version__
>>>>>>> Out[6]: '4.2.1'
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> http://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20130603/dbd7f7bf/attachment-0001.html>