[Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly

Mon Jun 3 03:47:01 CEST 2013

On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>  That is strange. Can you give me the gist of what the traceback is? I'm
> using the same except my lxml is 2.9.1 but that shouldn't matter. I vote to
> get rid of the lxml functionality since it's not going to parse invalid
> html, which is what most of the web consists of.
>
>
> --
> Best,
> Phillip Cloud
>
>
> On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>> > This is the reply I got from the lxml people about an "incorrect" parse
>> > of
>> > the failed bank list page. It wasn't actually an incorrect parse, the
>> > page
>> > has invalid markup and lxml makes no promises about that. Moral of the
>> > story: only use html5lib when parsing HTML tables. Should I reopen the
>> > lxml
>> > functionality then, with a big honking error in the documentation
>> > telling
>> > users to tidy up the HTML they want to parse if they want to use lxml or
>> > just scrap the lxml functionality entirely? No need to clutter up the
>> > codebase.
>> >
>> > --
>> > Best,
>> > Phillip Cloud
>> >
>> >
>> > ---------- Forwarded message ----------
>> > From: scoder <1181905 at bugs.launchpad.net>
>> > Date: Sun, Jun 2, 2013 at 2:14 AM
>> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly
>> > To: cpcloud at gmail.com
>> >
>> >
>> > The HTML page doesn't validate, even my browser shows me an HTML error.
>> > The <td> tag you are looking for is not inside of a <tr> tag, so it's
>> > actually correct that the last two tests in your script fail because
>> > they are looking for something that's not there.
>> >
>> > If you think that the parser in libxml2 should be able to fix this HTML
>> > error automatically, rather than just parsing through it, please file a
>> > bug report for the libxml2 project. Alternatively, adapt your script to
>> > the broken HTML or use an HTML tidying tool to fix the markup.
>> >
>> >
>> > ** Changed in: lxml
>> >        Status: New => Invalid
>> >
>> > --
>> > You received this bug notification because you are subscribed to the bug
>> > report.
>> > https://bugs.launchpad.net/bugs/1181905
>> >
>> > Title:
>> >   tr elements are not parsed correctly
>> >
>> > Status in lxml - the Python XML toolkit:
>> >   Invalid
>> >
>> > Bug description:
>> >   Python              : sys.version_info(major=2, minor=7, micro=5,
>> > releaselevel='final', serial=0)
>> >   lxml.etree          : (3, 2, 1, 0)
>> >   libxml used         : (2, 9, 1)
>> >   libxml compiled     : (2, 9, 1)
>> >   libxslt used        : (1, 1, 28)
>> >   libxslt compiled    : (1, 1, 28)
>> >
>> >   See the attached script. The url
>> >   http://www.fdic.gov/bank/individual/failed/banklist.html is not parsed
>> >   correctly by lxml. the element containing 'Gold Canyon' is just left
>> >   out, while all of the other elements seem to be there.
>> >
>> > To manage notifications about this bug go to:
>> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions
>> >
>> >
>> > _______________________________________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > http://mail.python.org/mailman/listinfo/pandas-dev
>> >
>>
>> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0.
>> Wasted a lot of time already on this today so the release candidate is
>> going to have to wait until this is sorted out and passing cleanly.
>
>

Perhaps it should attempt lxml and fall back on BS? When lxml succeeds
it is much faster.