[Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly

Mon Jun 3 03:19:41 CEST 2013

On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
> This is the reply I got from the lxml people about an "incorrect" parse of
> the failed bank list page. It wasn't actually an incorrect parse, the page
> has invalid markup and lxml makes no promises about that. Moral of the
> story: only use html5lib when parsing HTML tables. Should I reopen the lxml
> functionality then, with a big honking error in the documentation telling
> users to tidy up the HTML they want to parse if they want to use lxml or
> just scrap the lxml functionality entirely? No need to clutter up the
> codebase.
>
> --
> Best,
> Phillip Cloud
>
>
> ---------- Forwarded message ----------
> From: scoder <1181905 at bugs.launchpad.net>
> Date: Sun, Jun 2, 2013 at 2:14 AM
> Subject: [Bug 1181905] Re: tr elements are not parsed correctly
> To: cpcloud at gmail.com
>
>
> The HTML page doesn't validate, even my browser shows me an HTML error.
> The <td> tag you are looking for is not inside of a <tr> tag, so it's
> actually correct that the last two tests in your script fail because
> they are looking for something that's not there.
>
> If you think that the parser in libxml2 should be able to fix this HTML
> error automatically, rather than just parsing through it, please file a
> bug report for the libxml2 project. Alternatively, adapt your script to
> the broken HTML or use an HTML tidying tool to fix the markup.
>
>
> ** Changed in: lxml
>        Status: New => Invalid
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1181905
>
> Title:
>   tr elements are not parsed correctly
>
> Status in lxml - the Python XML toolkit:
>   Invalid
>
> Bug description:
>   Python              : sys.version_info(major=2, minor=7, micro=5,
> releaselevel='final', serial=0)
>   lxml.etree          : (3, 2, 1, 0)
>   libxml used         : (2, 9, 1)
>   libxml compiled     : (2, 9, 1)
>   libxslt used        : (1, 1, 28)
>   libxslt compiled    : (1, 1, 28)
>
>   See the attached script. The url
>   http://www.fdic.gov/bank/individual/failed/banklist.html is not parsed
>   correctly by lxml. the element containing 'Gold Canyon' is just left
>   out, while all of the other elements seem to be there.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> http://mail.python.org/mailman/listinfo/pandas-dev
>

Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0.
Wasted a lot of time already on this today so the release candidate is
going to have to wait until this is sorted out and passing cleanly.