Beautiful Soup - close tags more promptly?

Chris Angelico rosuav at gmail.com
Mon Oct 24 10:01:19 EDT 2022


On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-python at hjp.at> wrote:
>
> On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
> > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > > Ron has already noted that the lxml and html5 parser do the right thing,
> > > so just for the record:
> > >
> > > The HTML fragment above is well-formed and contains a number of li
> > > elements at the same level directly below the ol element, not lots of
> > > nested li elements. The end tag of the li element is optional (except in
> > > XHTML) and li elements don't nest.
> >
> > That's correct. However, parsing it with html.parser and then
> > reconstituting it as shown in the example code results in all the
> > </li> tags coming up right before the </ol>, indicating that the <li>
> > tags were parsed as deeply nested rather than as siblings.
>
> Yes, I got that. What I wanted to say was that this is indeed a bug in
> html.parser and not an error (or sloppyness, as you called it) in the
> input or ambiguity in the HTML standard.

I described the HTML as "sloppy" for a number of reasons, but I was of
the understanding that it's generally recommended to have the closing
tags. Not that it matters much.

> > which html5lib seems to be doing fine. Whether
> > it has other issues, I don't know, but I guess I'll find out....
>
> The link somebody posted mentions that it's "very slow". Which may or
> may not be a problem when you have to parse 9000 files. But if it does
> implement HTML5 correctly, it should parse any file the same as a modern
> browser does (maybe excluding quirks mode).
>

Yeah. TBH I think the two-hour run time is primarily dominated by
network delays, not parsing time, but if I had a service where people
could upload HTML to be parsed, that might affect throughput.

For the record, if anyone else is considering html5lib: It is likely
"fast enough", even if not fast. Give it a try.

(And I know what slow parsing feels like. Parsing a ~100MB file with a
decently-fast grammar-based lexer takes a good while. Parsing the same
content after it's been converted to JSON? Fast.)

ChrisA


More information about the Python-list mailing list