Beautiful Soup - close tags more promptly?

Roel Schroeven roel at roelschroeven.net
Mon Oct 24 04:33:00 EDT 2022


(Oops, accidentally only sent to Chris instead of to the list)

Op 24/10/2022 om 10:02 schreef Chris Angelico:
> On Mon, 24 Oct 2022 at 18:43, Roel Schroeven <roel at roelschroeven.net> 
> wrote:
> > Using html5lib (install package html5lib) instead of html.parser seems
> > to do the trick: it inserts </li> right before the next <li>, and one
> > before the closing </ol> . On my system the same happens when I don't
> > specify a parser, but IIRC that's a bit fragile because other systems
> > can choose different parsers of you don't explicity specify one.
> >
>
> Ah, cool. Thanks. I'm not entirely sure of the various advantages and
> disadvantages of the different parsers; is there a tabulation
> anywhere, or at least a list of recommendations on choosing a suitable
> parser?
There's a bit of information here: 
https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser
Not much but maybe it can be helpful.
> I'm dealing with a HUGE mess of different coding standards, all the
> way from 1990s-level stuff (images for indentation, tables for
> formatting, and <FONT FACE="Wingdings">) up through HTML4 (a good few
> of the pages have at least some <meta> tags and declare their
> encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
> There's even a couple of pages that use frames - yes, the old style
> with a <noframes> block in case the browser can't handle it. I went
> with html.parser on the expectation that it'd give the best "across
> all standards" results, but I'll give html5lib a try and see if it
> does better.
>
> Would rather not try to use different parsers for different files, but
> if necessary, I'll figure something out.
>
> (For reference, this is roughly 9000 HTML files that have to be
> parsed. Doing things by hand is basically not an option.)
>
I'd give lxml a try too. Maybe try to preprocess the HTML using 
html-tidy (https://www.html-tidy.org/), that might actually do a pretty 
good job of getting rid of all kinds of historical inconsistencies.
Somehow checking if any solution works for thousands of input files will 
always be a pain, I'm afraid.

-- 
"I've come up with a set of rules that describe our reactions to technologies:
1. Anything that is in the world when you’re born is normal and ordinary and is
    just a natural part of the way the world works.
2. Anything that's invented between when you’re fifteen and thirty-five is new
    and exciting and revolutionary and you can probably get a career in it.
3. Anything invented after you're thirty-five is against the natural order of things."
         -- Douglas Adams, The Salmon of Doubt



More information about the Python-list mailing list