Beautiful Soup - close tags more promptly?

Roel Schroeven roel at roelschroeven.net
Mon Oct 24 03:42:13 EDT 2022


Op 24/10/2022 om 4:29 schreef Chris Angelico:
> Parsing ancient HTML files is something Beautiful Soup is normally
> great at. But I've run into a small problem, caused by this sort of
> sloppy HTML:
>
> from bs4 import BeautifulSoup
> # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
> blob = b"""
> <OL>
> <LI>'THERE sinks the nebulous star we call the Sun,
> <LI>If that hypothesis of theirs be sound,'
> <LI>Said Ida;' let us down and rest:' and we
> <LI>Down from the lean and wrinkled precipices,
> <LI>By every coppice-feather'd chasm and cleft,
> <LI>Dropt thro' the ambrosial gloom to where below
> <LI>No bigger than a glow-worm shone the tent
> <LI>Lamp-lit from the inner. Once she lean'd on me,
> <LI>Descending; once or twice she lent her hand,
> <LI>And blissful palpitations in the blood,
> <LI>Stirring a sudden transport rose and fell.
> </OL>
> """
> soup = BeautifulSoup(blob, "html.parser")
> print(soup)
>
>
> On this small snippet, it works acceptably, but puts a large number of
> </li> tags immediately before the </ol>. On the original file (see
> link if you want to try it), this blows right through the default
> recursion limit, due to the crazy number of "nested" list items.
>
> Is there a way to tell BS4 on parse that these <li> elements end at
> the next <li>, rather than waiting for the final </ol>? This would
> make tidier output, and also eliminate most of the recursion levels.
>
Using html5lib (install package html5lib) instead of html.parser seems 
to do the trick: it inserts </li> right before the next <li>, and one 
before the closing </ol> . On my system the same happens when I don't 
specify a parser, but IIRC that's a bit fragile because other systems 
can choose different parsers of you don't explicity specify one.

-- 
"I love science, and it pains me to think that to so many are terrified
of the subject or feel that choosing science means you cannot also
choose compassion, or the arts, or be awed by nature. Science is not
meant to cure us of mystery, but to reinvent and reinvigorate it."
         -- Robert Sapolsky



More information about the Python-list mailing list