HTML extraction

Wed Dec 8 05:00:26 EST 2021

Roland Mueller <roland.em0001 at googlemail.com> writes:

> But isn't bs4 only for SOAP content?
> Can bs4 or lxml cope with HTML code that does not comply with XML as the
> following fragment?
>
> <p>A
> <p>B
> <hr>
>

bs4 can do it, but lxml wants correct XML.

Jupyter console 6.4.0

Python 3.9.9 (main, Nov 16 2021, 07:21:43) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.29.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from bs4 import BeautifulSoup as bs

In [2]: soup = bs('<p>A<p>B<hr>')

In [3]: soup.p
Out[3]: <p>A</p>

In [4]: soup.find_all('p')
Out[4]: [<p>A</p>, <p>B</p>]

In [5]: from lxml import etree

In [6]: root = etree.fromstring('<p>A<p>B<hr>')
Traceback (most recent call last):

  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "/var/folders/2l/pdng2d2x18d00m41l6r2ccjr0000gn/T/ipykernel_96220/3376613260.py", line 1, in <module>
    root = etree.fromstring('<p>A<p>B<hr>')

  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring

  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument

  File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc

  File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc

  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc

  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult

  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError

  File "<string>", line 1
XMLSyntaxError: Premature end of data in tag hr line 1, line 1, column 13
-- 
Pieter van Oostrum <pieter at vanoostrum.org>
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]