[Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly

Phillip Cloud cpcloud at gmail.com
Mon Jun 3 03:31:52 CEST 2013


 That is strange. Can you give me the gist of what the traceback is? I'm
using the same except my lxml is 2.9.1 but that shouldn't matter. I vote to
get rid of the lxml functionality since it's not going to parse invalid
html, which is what most of the web consists of.


--
Best,
Phillip Cloud


On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
> > This is the reply I got from the lxml people about an "incorrect" parse
> of
> > the failed bank list page. It wasn't actually an incorrect parse, the
> page
> > has invalid markup and lxml makes no promises about that. Moral of the
> > story: only use html5lib when parsing HTML tables. Should I reopen the
> lxml
> > functionality then, with a big honking error in the documentation telling
> > users to tidy up the HTML they want to parse if they want to use lxml or
> > just scrap the lxml functionality entirely? No need to clutter up the
> > codebase.
> >
> > --
> > Best,
> > Phillip Cloud
> >
> >
> > ---------- Forwarded message ----------
> > From: scoder <1181905 at bugs.launchpad.net>
> > Date: Sun, Jun 2, 2013 at 2:14 AM
> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly
> > To: cpcloud at gmail.com
> >
> >
> > The HTML page doesn't validate, even my browser shows me an HTML error.
> > The <td> tag you are looking for is not inside of a <tr> tag, so it's
> > actually correct that the last two tests in your script fail because
> > they are looking for something that's not there.
> >
> > If you think that the parser in libxml2 should be able to fix this HTML
> > error automatically, rather than just parsing through it, please file a
> > bug report for the libxml2 project. Alternatively, adapt your script to
> > the broken HTML or use an HTML tidying tool to fix the markup.
> >
> >
> > ** Changed in: lxml
> >        Status: New => Invalid
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1181905
> >
> > Title:
> >   tr elements are not parsed correctly
> >
> > Status in lxml - the Python XML toolkit:
> >   Invalid
> >
> > Bug description:
> >   Python              : sys.version_info(major=2, minor=7, micro=5,
> > releaselevel='final', serial=0)
> >   lxml.etree          : (3, 2, 1, 0)
> >   libxml used         : (2, 9, 1)
> >   libxml compiled     : (2, 9, 1)
> >   libxslt used        : (1, 1, 28)
> >   libxslt compiled    : (1, 1, 28)
> >
> >   See the attached script. The url
> >   http://www.fdic.gov/bank/individual/failed/banklist.html is not parsed
> >   correctly by lxml. the element containing 'Gold Canyon' is just left
> >   out, while all of the other elements seem to be there.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions
> >
> >
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > http://mail.python.org/mailman/listinfo/pandas-dev
> >
>
> Test suite fails with bs4 4.2.1 and latest lxml with libxml2 2.9.0.
> Wasted a lot of time already on this today so the release candidate is
> going to have to wait until this is sorted out and passing cleanly.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20130602/06dd142d/attachment.html>


More information about the Pandas-dev mailing list