Beautiful Soup - close tags more promptly?

Peter J. Holzer hjp-python at hjp.at
Mon Oct 24 06:32:11 EDT 2022


On 2022-10-24 13:29:13 +1100, Chris Angelico wrote:
> Parsing ancient HTML files is something Beautiful Soup is normally
> great at. But I've run into a small problem, caused by this sort of
> sloppy HTML:
> 
> from bs4 import BeautifulSoup
> # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
> blob = b"""
> <OL>
> <LI>'THERE sinks the nebulous star we call the Sun,
> <LI>If that hypothesis of theirs be sound,'
[...]
> <LI>Stirring a sudden transport rose and fell.
> </OL>
> """
> soup = BeautifulSoup(blob, "html.parser")
> print(soup)
> 
> 
> On this small snippet, it works acceptably, but puts a large number of
> </li> tags immediately before the </ol>.

Ron has already noted that the lxml and html5 parser do the right thing,
so just for the record:

The HTML fragment above is well-formed and contains a number of li
elements at the same level directly below the ol element, not lots of
nested li elements. The end tag of the li element is optional (except in
XHTML) and li elements don't nest.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20221024/0c158494/attachment.sig>


More information about the Python-list mailing list