Beautiful Soup - close tags more promptly?

Peter J. Holzer hjp-python at hjp.at
Mon Oct 24 08:21:34 EDT 2022


On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
> On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > Ron has already noted that the lxml and html5 parser do the right thing,
> > so just for the record:
> >
> > The HTML fragment above is well-formed and contains a number of li
> > elements at the same level directly below the ol element, not lots of
> > nested li elements. The end tag of the li element is optional (except in
> > XHTML) and li elements don't nest.
> 
> That's correct. However, parsing it with html.parser and then
> reconstituting it as shown in the example code results in all the
> </li> tags coming up right before the </ol>, indicating that the <li>
> tags were parsed as deeply nested rather than as siblings.

Yes, I got that. What I wanted to say was that this is indeed a bug in
html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.


> In order to get a successful parse out of this, I need something which
> sees them as siblings,

Right, but Roel (correct name this time) had already posted that lxml
and html5lib parse this correctly, so I saw no need to belabour that
point.

> which html5lib seems to be doing fine. Whether
> it has other issues, I don't know, but I guess I'll find out....

The link somebody posted mentions that it's "very slow". Which may or
may not be a problem when you have to parse 9000 files. But if it does
implement HTML5 correctly, it should parse any file the same as a modern
browser does (maybe excluding quirks mode).

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20221024/af4a94d5/attachment.sig>


More information about the Python-list mailing list