[Pandas-dev] Fwd: [Bug 1181905] Re: tr elements are not parsed correctly

Phillip Cloud cpcloud at gmail.com
Mon Jun 3 05:54:34 CEST 2013


wes there's an issue with your anaconda installation. run

# make sure you're using the right conda environment here it tripped me up
the first time

pip uninstall lxml
pip uninstall beautifulsoup
pip uninstall beautifulsoup4
pip install lxml
pip install beautifulsoup4

and try again


--
Best,
Phillip Cloud


On Sun, Jun 2, 2013 at 10:37 PM, Phillip Cloud <cpcloud at gmail.com> wrote:

> here the gist of the working code https://gist.github.com/cpcloud/5695835
>
>
> --
> Best,
> Phillip Cloud
>
>
> On Sun, Jun 2, 2013 at 10:34 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>
>> Sorry that should be from lxml.html import parse
>>
>>
>> --
>> Best,
>> Phillip Cloud
>>
>>
>> On Sun, Jun 2, 2013 at 10:33 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>>
>>> saw that u fixed the first test. second is correctly failing because the
>>> value retrieved is wrong. i replicated your setup sans libxml2 and nothing
>>> fails. travis is passing these tests, so i'm not sure exactly what the
>>> issue is. can you try the following
>>>
>>> from lxml import parse
>>> url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'
>>> doc = parse(url)
>>> len(doc.xpath('.//table')) > 0
>>>
>>> from bs4 import BeautifulSoup
>>> from contextlib import closing
>>> from urllib2 import urlopen
>>> with contextlib.closing(urllib2.urlopen(url)) as f:
>>>     soup = BeautifulSoup(f.read(), features='lxml')
>>>
>>> len(soup.find_all('table')) > 0
>>>
>>>
>>> --
>>> Best,
>>> Phillip Cloud
>>>
>>>
>>> On Sun, Jun 2, 2013 at 10:06 PM, Wes McKinney <wesmckinn at gmail.com>wrote:
>>>
>>>> On Sun, Jun 2, 2013 at 6:57 PM, Phillip Cloud <cpcloud at gmail.com>
>>>> wrote:
>>>> > yeah that's better than dumping it altogether. you can use a strict
>>>> parser
>>>> > that doesn't try to recover broken html. btw what tests are breaking?
>>>> i
>>>> > can't get any of them to break...
>>>> >
>>>> >
>>>> > --
>>>> > Best,
>>>> > Phillip Cloud
>>>> >
>>>> >
>>>> > On Sun, Jun 2, 2013 at 9:47 PM, Wes McKinney <wesmckinn at gmail.com>
>>>> wrote:
>>>> >>
>>>> >> On Sun, Jun 2, 2013 at 6:31 PM, Phillip Cloud <cpcloud at gmail.com>
>>>> wrote:
>>>> >> >  That is strange. Can you give me the gist of what the traceback
>>>> is? I'm
>>>> >> > using the same except my lxml is 2.9.1 but that shouldn't matter.
>>>> I vote
>>>> >> > to
>>>> >> > get rid of the lxml functionality since it's not going to parse
>>>> invalid
>>>> >> > html, which is what most of the web consists of.
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Best,
>>>> >> > Phillip Cloud
>>>> >> >
>>>> >> >
>>>> >> > On Sun, Jun 2, 2013 at 9:19 PM, Wes McKinney <wesmckinn at gmail.com>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> On Sun, Jun 2, 2013 at 3:21 PM, Phillip Cloud <cpcloud at gmail.com>
>>>> >> >> wrote:
>>>> >> >> > This is the reply I got from the lxml people about an
>>>> "incorrect"
>>>> >> >> > parse
>>>> >> >> > of
>>>> >> >> > the failed bank list page. It wasn't actually an incorrect
>>>> parse, the
>>>> >> >> > page
>>>> >> >> > has invalid markup and lxml makes no promises about that. Moral
>>>> of
>>>> >> >> > the
>>>> >> >> > story: only use html5lib when parsing HTML tables. Should I
>>>> reopen
>>>> >> >> > the
>>>> >> >> > lxml
>>>> >> >> > functionality then, with a big honking error in the
>>>> documentation
>>>> >> >> > telling
>>>> >> >> > users to tidy up the HTML they want to parse if they want to
>>>> use lxml
>>>> >> >> > or
>>>> >> >> > just scrap the lxml functionality entirely? No need to clutter
>>>> up the
>>>> >> >> > codebase.
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Best,
>>>> >> >> > Phillip Cloud
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > ---------- Forwarded message ----------
>>>> >> >> > From: scoder <1181905 at bugs.launchpad.net>
>>>> >> >> > Date: Sun, Jun 2, 2013 at 2:14 AM
>>>> >> >> > Subject: [Bug 1181905] Re: tr elements are not parsed correctly
>>>> >> >> > To: cpcloud at gmail.com
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > The HTML page doesn't validate, even my browser shows me an HTML
>>>> >> >> > error.
>>>> >> >> > The <td> tag you are looking for is not inside of a <tr> tag,
>>>> so it's
>>>> >> >> > actually correct that the last two tests in your script fail
>>>> because
>>>> >> >> > they are looking for something that's not there.
>>>> >> >> >
>>>> >> >> > If you think that the parser in libxml2 should be able to fix
>>>> this
>>>> >> >> > HTML
>>>> >> >> > error automatically, rather than just parsing through it,
>>>> please file
>>>> >> >> > a
>>>> >> >> > bug report for the libxml2 project. Alternatively, adapt your
>>>> script
>>>> >> >> > to
>>>> >> >> > the broken HTML or use an HTML tidying tool to fix the markup.
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > ** Changed in: lxml
>>>> >> >> >        Status: New => Invalid
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > You received this bug notification because you are subscribed
>>>> to the
>>>> >> >> > bug
>>>> >> >> > report.
>>>> >> >> > https://bugs.launchpad.net/bugs/1181905
>>>> >> >> >
>>>> >> >> > Title:
>>>> >> >> >   tr elements are not parsed correctly
>>>> >> >> >
>>>> >> >> > Status in lxml - the Python XML toolkit:
>>>> >> >> >   Invalid
>>>> >> >> >
>>>> >> >> > Bug description:
>>>> >> >> >   Python              : sys.version_info(major=2, minor=7,
>>>> micro=5,
>>>> >> >> > releaselevel='final', serial=0)
>>>> >> >> >   lxml.etree          : (3, 2, 1, 0)
>>>> >> >> >   libxml used         : (2, 9, 1)
>>>> >> >> >   libxml compiled     : (2, 9, 1)
>>>> >> >> >   libxslt used        : (1, 1, 28)
>>>> >> >> >   libxslt compiled    : (1, 1, 28)
>>>> >> >> >
>>>> >> >> >   See the attached script. The url
>>>> >> >> >   http://www.fdic.gov/bank/individual/failed/banklist.html is
>>>> not
>>>> >> >> > parsed
>>>> >> >> >   correctly by lxml. the element containing 'Gold Canyon' is
>>>> just
>>>> >> >> > left
>>>> >> >> >   out, while all of the other elements seem to be there.
>>>> >> >> >
>>>> >> >> > To manage notifications about this bug go to:
>>>> >> >> > https://bugs.launchpad.net/lxml/+bug/1181905/+subscriptions
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > _______________________________________________
>>>> >> >> > Pandas-dev mailing list
>>>> >> >> > Pandas-dev at python.org
>>>> >> >> > http://mail.python.org/mailman/listinfo/pandas-dev
>>>> >> >> >
>>>> >> >>
>>>> >> >> Test suite fails with bs4 4.2.1 and latest lxml with libxml2
>>>> 2.9.0.
>>>> >> >> Wasted a lot of time already on this today so the release
>>>> candidate is
>>>> >> >> going to have to wait until this is sorted out and passing
>>>> cleanly.
>>>> >> >
>>>> >> >
>>>> >>
>>>> >> Perhaps it should attempt lxml and fall back on BS? When lxml
>>>> succeeds
>>>> >> it is much faster.
>>>> >
>>>> >
>>>>
>>>> https://gist.github.com/wesm/5695768
>>>>
>>>> In [3]: import lxml.etree as etree
>>>>
>>>> In [4]: etree.__version__
>>>> Out[4]: u'3.2.1'
>>>>
>>>> libxml2 version 2.9.0. i can upgrade if you think it might be that
>>>>
>>>> In [5]: import bs4
>>>>
>>>> In [6]: bs4.__version__
>>>> Out[6]: '4.2.1'
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20130602/5f81d221/attachment.html>


More information about the Pandas-dev mailing list