[Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly

Phillip Cloud cpcloud at gmail.com
Mon Jun 3 07:52:42 CEST 2013


ok i spent another 2 hours on this out of curiosity and frustration and
because i hate magic like this.

i tried all of these outside of anaconda
it's not the libxml2 version, i tried 2.8.0, 2.9.0, and 2.9.1
it's not the bs4 version, i tried 4.2.0 and 4.2.1
it's not the lxml version, i tried 3.2.0 and 3.2.1

the only time lxml + bs4 breaks is in anaconda + bs4 + lxml 3.2.0

there's an issue with the markup too, i'll update it but again there's no
way to control the validity of other people's markup. the failed ban klist
and the python xy plugins tables are both invalid pages so there are no
promises for lxml. i will also make the change to allow users the choice of
whichever they want to use, but i really think if lxml raises an
XMLSyntaxError then pandas should NOT try to use html5lib, the user should
be made aware of what they are doing, namely that the page they are trying
to parse is invalid and that they should explicitly pass flavor='html5lib'
if they want to parse the page. they would have to install html5lib anyway
to get the former behavior.

since most of the web is crap html i really think there's a minor benefit
to including a fast parser when most of the time it will just be unable to
parse a page and thus it will be fast at determining that it cannot parse
the page. i don't know for sure but i doubt there are many huge html tables
out there that are contained in valid html. anyway users can use html5lib +
bs4 themselves to clean the markup and parse that with lxml if they are
going to store it, but that's useless too since you can put it in a format
that is easier to parse as soon as it's in the frame

wes, i know u have the ultimate say and of course i will go along with
whatever you think is best for pandas, just wanted to give my 2c. i'm happy
to hear other opinions as well
--
Best,
Phillip Cloud


On Mon, Jun 3, 2013 at 12:26 AM, Phillip Cloud <cpcloud at gmail.com> wrote:

> alright i've spent 2 hours tracking this down and here are the results
>
> for anaconda lxml 3.2.1 works but 3.2.0 doesn't.
> for a regular virtualenv 3.2.0 works fine (so does 3.2.1)
> travis is passing these tests so i think there's something weird with
> anaconda's path stuff
>
> i'm not sure what the issue is there. could be a path issue somewhere, but
> frankly this is not worth spending any more time on.
>
> should i add something to the docs along the lines of if you're using
> anaconda and you want lxml, then use version 3.2.1?
>
> an additional bug sprang up which is that the tests are run if lxml
> installed but not bs4 (they should run in this case), this i will fix and
> submit a pr.
>
> --
> Best,
> Phillip Cloud
>
>
> On Sun, Jun 2, 2013 at 11:54 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>
>> wes there's an issue with your anaconda installation. run
>>
>> # make sure you're using the right conda environment here it tripped me
>> up the first time
>>
>> pip uninstall lxml
>> pip uninstall beautifulsoup
>> pip uninstall beautifulsoup4
>> pip install lxml
>> pip install beautifulsoup4
>>
>> and try again
>>
>>
>> --
>> Best,
>> Phillip Cloud
>>
>>
>> On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>>
>>> here the gist of the working code
>>> https://gist.github.com/cpcloud/5695835
>>>
>>>
>>> --
>>> Best,
>>> Phillip Cloud
>>>
>>>
>>> On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud <cpcloud at gmail.com>wrote:
>>>
>>>> Sorry that should be from lxml.html import parse
>>>>
>>>>
>>>> --
>>>> Best,
>>>> Phillip Cloud
>>>>
>>>>
>>>> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud <cpcloud at gmail.com>wrote:
>>>>
>>>>> saw that u fixed the first test. second is correctly failing because
>>>>> the value retrieved is wrong. i replicated your setup sans libxml2 and
>>>>> nothing fails. travis is passing these tests, so i'm not sure exactly what
>>>>> the issue is. can you try the following
>>>>>
>>>>> from lxml import parse
>>>>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'
>>>>> doc = parse(url)
>>>>> len(doc.xpath('.//table')) > 0
>>>>>
>>>>> from bs4 import BeautifulSoup
>>>>> from contextlib import closing
>>>>> from urllib2 import urlopen
>>>>> with contextlib.closing(urllib2.urlopen(url)) as f:
>>>>>     soup = BeautifulSoup(f.read(), features='lxml')
>>>>>
>>>>> len(soup.find_all('table')) > 0
>>>>>
>>>>>
>>>>> --
>>>>> Best,
>>>>> Phillip Cloud
>>>>>
>>>>>
>>>>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney <wesmckinn at gmail.com>wrote:
>>>>>
>>>>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud <cpcloud at gmail.com>
>>>>>> wrote:
>>>>>> > yeah that's better than dumping it altogether. you can use a strict
>>>>>> parser
>>>>>> > that doesn't try to recover broken html. btw what tests are
>>>>>> breaking? i
>>>>>> > can't get any of them to break...
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Best,
>>>>>> > Phillip Cloud
>>>>>> >
>>>>>> >
>>>>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney <wesmckinn at gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud <cpcloud at gmail.com>
>>>>>> wrote:
>>>>>> >> >  That is strange. Can you give me the gist of what the traceback
>>>>>> is? I'm
>>>>>> >> > using the same except my lxml is 2.9.1 but that shouldn't
>>>>>> matter. I vote
>>>>>> >> > to
>>>>>> >> > get rid of the lxml functionality since it's not going to parse
>>>>>> invalid
>>>>>> >> > html, which is what most of the web consists of.
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > --
>>>>>> >> > Best,
>>>>>> >> > Phillip Cloud
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney <
>>>>>> wesmckinn at gmail.com>
>>>>>> >> > wrote:
>>>>>> >> >>
>>>>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud <
>>>>>> cpcloud at gmail.com>
>>>>>> >> >> wrote:
>>>>>> >> >> > This is the reply I got from the lxml people about an
>>>>>> "incorrect"
>>>>>> >> >> > parse
>>>>>> >> >> > of
>>>>>> >> >> > the failed bank list page. It wasn't actually an incorrect
>>>>>> parse, the
>>>>>> >> >> > page
>>>>>> >> >> > has invalid markup and lxml makes no promises about that.
>>>>>> Moral of
>>>>>> >> >> > the
>>>>>> >> >> > story: only use html5lib when parsing HTML tables. Should I
>>>>>> reopen
>>>>>> >> >> > the
>>>>>> >> >> > lxml
>>>>>> >> >> > functionality then, with a big honking error in the
>>>>>> documentation
>>>>>> >> >> > telling
>>>>>> >> >> > users to tidy up the HTML they want to parse if they want to
>>>>>> use lxml
>>>>>> >> >> > or
>>>>>> >> >> > just scrap the lxml functionality entirely? No need to
>>>>>> clutter up the
>>>>>> >> >> > codebase.
>>>>>> >> >> >
>>>>>> >> >> > --
>>>>>> >> >> > Best,
>>>>>> >> >> > Phillip Cloud
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > ---------- Forwarded message ----------
>>>>>> >> >> > From: scoder <1181905 at bugs.launchpad.net>
>>>>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM
>>>>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed
>>>>>> correctly
>>>>>> >> >> > To: cpcloud at gmail.com
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > The HTML page doesn't validate, even my browser shows me an
>>>>>> HTML
>>>>>> >> >> > error.
>>>>>> >> >> > The <td> tag you are looking for is not inside of a <tr> tag,
>>>>>> so it's
>>>>>> >> >> > actually correct that the last two tests in your script fail
>>>>>> because
>>>>>> >> >> > they are looking for something that's not there.
>>>>>> >> >> >
>>>>>> >> >> > If you think that the parser in libxml2 should be able to fix
>>>>>> this
>>>>>> >> >> > HTML
>>>>>> >> >> > error automatically, rather than just parsing through it,
>>>>>> please file
>>>>>> >> >> > a
>>>>>> >> >> > bug report for the libxml2 project. Alternatively, adapt your
>>>>>> script
>>>>>> >> >> > to
>>>>>> >> >> > the broken HTML or use an HTML tidying tool to fix the markup.
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > ** Changed in: lxml
>>>>>> >> >> >        Status: New => Invalid
>>>>>> >> >> >
>>>>>> >> >> > --
>>>>>> >> >> > You received this bug notification because you are subscribed
>>>>>> to the
>>>>>> >> >> > bug
>>>>>> >> >> > report.
>>>>>> >> >> > https://bugs.launchpad.net/bugs/1181905
>>>>>> >> >> >
>>>>>> >> >> > Title:
>>>>>> >> >> >   tr elements are not parsed correctly
>>>>>> >> >> >
>>>>>> >> >> > Status in lxml - the Python XML toolkit:
>>>>>> >> >> >   Invalid
>>>>>> >> >> >
>>>>>> >> >> > Bug description:
>>>>>> >> >> >   Python              : sys.version_info(major=2, minor=7,
>>>>>> micro=5,
>>>>>> >> >> > releaselevel='final', serial=0)
>>>>>> >> >> >   lxml.etree          : (3, 2, 1, 0)
>>>>>> >> >> >   libxml used         : (2, 9, 1)
>>>>>> >> >> >   libxml compiled     : (2, 9, 1)
>>>>>> >> >> >   libxslt used        : (1, 1, 28)
>>>>>> >> >> >   libxslt compiled    : (1, 1, 28)
>>>>>> >> >> >
>>>>>> >> >> >   See the attached script. The url
>>>>>> >> >> >   http://www.fdic.gov/bank/individual/failed/banklist.htmlis not
>>>>>> >> >> > parsed
>>>>>> >> >> >   correctly by lxml. the element containing 'Gold Canyon' is
>>>>>> just
>>>>>> >> >> > left
>>>>>> >> >> >   out, while all of the other elements seem to be there.
>>>>>> >> >> >
>>>>>> >> >> > To manage notifications about this bug go to:
>>>>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > _______________________________________________
>>>>>> >> >> > Pandas-dev mailing list
>>>>>> >> >> > Pandas-dev at python.org
>>>>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2
>>>>>> 2.9.0.
>>>>>> >> >> Wasted a lot of time already on this today so the release
>>>>>> candidate is
>>>>>> >> >> going to have to wait until this is sorted out and passing
>>>>>> cleanly.
>>>>>> >> >
>>>>>> >> >
>>>>>> >>
>>>>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml
>>>>>> succeeds
>>>>>> >> it is much faster.
>>>>>> >
>>>>>> >
>>>>>>
>>>>>> https://gist.github.com/wesm/5695768
>>>>>>
>>>>>> In [3]: import lxml.etree as etree
>>>>>>
>>>>>> In [4]: etree.__version__
>>>>>> Out[4]: u'3.2.1'
>>>>>>
>>>>>> libxml2 version 2.9.0. i can upgrade if you think it might be that
>>>>>>
>>>>>> In [5]: import bs4
>>>>>>
>>>>>> In [6]: bs4.__version__
>>>>>> Out[6]: '4.2.1'
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20130603/2dc7a364/attachment-0001.html>


More information about the Pandas-dev mailing list