[Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly
Wes McKinney
wesmckinn at gmail.com
Mon Jun 3 03:19:41 CEST 2013
On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
> This is the reply I got from the lxml people about an "incorrect" parse of
> the failed bank list page. It wasn't actually an incorrect parse, the page
> has invalid markup and lxml makes no promises about that. Moral of the
> story: only use html5lib when parsing HTML tables. Should I reopen the lxml
> functionality then, with a big honking error in the documentation telling
> users to tidy up the HTML they want to parse if they want to use lxml or
> just scrap the lxml functionality entirely? No need to clutter up the
> codebase.
>
> --
> Best,
> Phillip Cloud
>
>
> ---------- Forwarded message ----------
> From: scoder <1181905 at bugs.launchpad.net>
> Date: Sun, Jun 2, 2013 at 2:14 AM
> Subject: [Bug 1181905] Re: tr elements are not parsed correctly
> To: cpcloud at gmail.com
>
>
> The HTML page doesn't validate, even my browser shows me an HTML error.
> The <td> tag you are looking for is not inside of a <tr> tag, so it's
> actually correct that the last two tests in your script fail because
> they are looking for something that's not there.
>
> If you think that the parser in libxml2 should be able to fix this HTML
> error automatically, rather than just parsing through it, please file a
> bug report for the libxml2 project. Alternatively, adapt your script to
> the broken HTML or use an HTML tidying tool to fix the markup.
>
>
> ** Changed in: lxml
> Status: New => Invalid
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1181905
>
> Title:
> tr elements are not parsed correctly
>
> Status in lxml - the Python XML toolkit:
> Invalid
>
> Bug description:
> Python : sys.version_info(major=2, minor=7, micro=5,
> releaselevel='final', serial=0)
> lxml.etree : (3, 2, 1, 0)
> libxml used : (2, 9, 1)
> libxml compiled : (2, 9, 1)
> libxslt used : (1, 1, 28)
> libxslt compiled : (1, 1, 28)
>
> See the attached script. The url
> http://www.fdic.gov/bank/individual/failed/banklist.html is not parsed
> correctly by lxml. the element containing 'Gold Canyon' is just left
> out, while all of the other elements seem to be there.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> http://mail.python.org/mailman/listinfo/pandas-dev
>
Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0.
Wasted a lot of time already on this today so the release candidate is
going to have to wait until this is sorted out and passing cleanly.
More information about the Pandas-dev
mailing list