Beautiful Soup - close tags more promptly?

Chris Angelico rosuav at gmail.com
Sun Oct 23 22:29:13 EDT 2022


Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:

from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""
<OL>
<LI>'THERE sinks the nebulous star we call the Sun,
<LI>If that hypothesis of theirs be sound,'
<LI>Said Ida;' let us down and rest:' and we
<LI>Down from the lean and wrinkled precipices,
<LI>By every coppice-feather'd chasm and cleft,
<LI>Dropt thro' the ambrosial gloom to where below
<LI>No bigger than a glow-worm shone the tent
<LI>Lamp-lit from the inner. Once she lean'd on me,
<LI>Descending; once or twice she lent her hand,
<LI>And blissful palpitations in the blood,
<LI>Stirring a sudden transport rose and fell.
</OL>
"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)


On this small snippet, it works acceptably, but puts a large number of
</li> tags immediately before the </ol>. On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.

Is there a way to tell BS4 on parse that these <li> elements end at
the next <li>, rather than waiting for the final </ol>? This would
make tidier output, and also eliminate most of the recursion levels.

ChrisA


More information about the Python-list mailing list