Beautiful Soup - close tags more promptly?

Mon Oct 24 04:02:15 EDT 2022

On Mon, 24 Oct 2022 at 18:43, Roel Schroeven <roel at roelschroeven.net> wrote:
>
> Op 24/10/2022 om 4:29 schreef Chris Angelico:
> > Parsing ancient HTML files is something Beautiful Soup is normally
> > great at. But I've run into a small problem, caused by this sort of
> > sloppy HTML:
> >
> > from bs4 import BeautifulSoup
> > # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
> > blob = b"""
> > <OL>
> > <LI>'THERE sinks the nebulous star we call the Sun,
> > <LI>If that hypothesis of theirs be sound,'
> > <LI>Said Ida;' let us down and rest:' and we
> > <LI>Down from the lean and wrinkled precipices,
> > <LI>By every coppice-feather'd chasm and cleft,
> > <LI>Dropt thro' the ambrosial gloom to where below
> > <LI>No bigger than a glow-worm shone the tent
> > <LI>Lamp-lit from the inner. Once she lean'd on me,
> > <LI>Descending; once or twice she lent her hand,
> > <LI>And blissful palpitations in the blood,
> > <LI>Stirring a sudden transport rose and fell.
> > </OL>
> > """
> > soup = BeautifulSoup(blob, "html.parser")
> > print(soup)
> >
> >
> > On this small snippet, it works acceptably, but puts a large number of
> > </li> tags immediately before the </ol>. On the original file (see
> > link if you want to try it), this blows right through the default
> > recursion limit, due to the crazy number of "nested" list items.
> >
> > Is there a way to tell BS4 on parse that these <li> elements end at
> > the next <li>, rather than waiting for the final </ol>? This would
> > make tidier output, and also eliminate most of the recursion levels.
> >
> Using html5lib (install package html5lib) instead of html.parser seems
> to do the trick: it inserts </li> right before the next <li>, and one
> before the closing </ol> . On my system the same happens when I don't
> specify a parser, but IIRC that's a bit fragile because other systems
> can choose different parsers of you don't explicity specify one.
>

Ah, cool. Thanks. I'm not entirely sure of the various advantages and
disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable
parser?

I'm dealing with a HUGE mess of different coding standards, all the
way from 1990s-level stuff (images for indentation, tables for
formatting, and <FONT FACE="Wingdings">) up through HTML4 (a good few
of the pages have at least some <meta> tags and declare their
encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
There's even a couple of pages that use frames - yes, the old style
with a <noframes> block in case the browser can't handle it. I went
with html.parser on the expectation that it'd give the best "across
all standards" results, but I'll give html5lib a try and see if it
does better.

Would rather not try to use different parsers for different files, but
if necessary, I'll figure something out.

(For reference, this is roughly 9000 HTML files that have to be
parsed. Doing things by hand is basically not an option.)

ChrisA