Beautiful Soup - close tags more promptly?

Chris Angelico rosuav at gmail.com
Mon Oct 24 06:56:13 EDT 2022


On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python at hjp.at> wrote:
> Ron has already noted that the lxml and html5 parser do the right thing,
> so just for the record:
>
> The HTML fragment above is well-formed and contains a number of li
> elements at the same level directly below the ol element, not lots of
> nested li elements. The end tag of the li element is optional (except in
> XHTML) and li elements don't nest.

That's correct. However, parsing it with html.parser and then
reconstituting it as shown in the example code results in all the
</li> tags coming up right before the </ol>, indicating that the <li>
tags were parsed as deeply nested rather than as siblings.

In order to get a successful parse out of this, I need something which
sees them as siblings, which html5lib seems to be doing fine. Whether
it has other issues, I don't know, but I guess I'll find out.... it's
currently running on the live site and taking several hours (due to
network delays and the server being slow, so I don't really want to
parallelize and overload the thing).

ChrisA


More information about the Python-list mailing list