Beautiful Soup - close tags more promptly?

Chris Angelico rosuav at gmail.com
Mon Oct 24 15:56:58 EDT 2022


On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer <hjp-python at hjp.at> wrote:
> There may be several reasons:
>
> * Historically, some browsers differed in which end tags were actually
>   optional. Since (AFAIK) no mainstream browser ever implemented a real
>   SGML parser (they were always "tag soup" parsers with lots of ad-hoc
>   rules) this sometimes even changed within the same browser depending
>   on context (e.g. a simple table might work but nested tables woudn't).
>   So people started to use end-tags defensively.
> * XHTML was for some time popular and it doesn't have any optional tags.
>   So people got into the habit of always using end tags and writing
>   empty tags as <XXX />.
> * Aesthetics: Always writing the end tags is more consistent and may
>   look more balanced.
> * Cargo-cult: People saw other people do that and copied the habit
>   without thinking about it.
>
>
> > Are you saying that it's better to omit them all?
>
> If you want to conserve keystrokes :-)
>
> I think it doesn't matter. Both are valid.
>
> > More importantly: Would you omit all the </p> closing tags you can, or
> > would you include them?
>
> I usually write them.

Interesting. So which of the above reasons is yours? Personally, I do
it for a slightly different reason: Many end tags are *situationally*
optional, and it's much easier to debug code when you
change/insert/remove something and nothing changes, than when doing so
affects the implicit closing tags.

> I also indent the contents of an element, so I
> would write your example as:
>
> <!DOCTYPE html>
> <html>
>   <body>
>     Hello, world!
>     <p>
>       Paragraph 2
>     </p>
>     <p>
>       Hey look, a third paragraph!
>     </p>
>   </body>
> </html>
>
> (As you can see I would also include the body tags to make that element
> explicit. I would normally also add a bit of boilerplate (especially a
> head with a charset and viewport definition), but I omit them here since
> they would change the parse tree)
>

Yeah - any REAL page would want quite a bit (very few pages these days
manage without a style sheet, and it seems that hardly any survive
without importing a few gigabytes of JavaScript, but that's not
mandatory), but in ancient pages, there's still a well-defined parse
structure for every tag sequences.

One thing I find quite interesting, though, is the way that browsers
*differ* in the face of bad nesting of tags. Recently I was struggling
to figure out a problem with an HTML form, and eventually found that
there was a spurious <form> tag way up higher in the page. Forms don't
nest, so that's invalid, but different browsers had slightly different
ways of showing it. (Obviously the W3C Validator was the most helpful
tool here, since it reports it as an error rather than constructing
any sort of DOM tree.)

ChrisA


More information about the Python-list mailing list