Mutating an HTML file with BeautifulSoup

Buck Evan buck.2019 at gmail.com
Sun Aug 21 20:04:41 EDT 2022


I've had much success doing round trips through the lxml.html parser.

https://lxml.de/lxmlhtml.html

I ditched bs for lxml long ago and never regretted it.

If you find that you have a bunch of invalid html that lxml inadvertently
"fixes", I would recommend adding a stutter-step to your project: perform a
noop roundtrip thru lxml on all files. I'd then analyze any diff by
progressively excluding changes via `grep -vP`.
Unless I'm mistaken, all such changes should fall into no more than a dozen
groups.




On Fri, Aug 19, 2022, 1:34 PM Chris Angelico <rosuav at gmail.com> wrote:

> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
> >>> html_doc = """<html><head><title>The Dormouse's story</title></head>
> <body>
> <p class="title"><b>The Dormouse's story</b></p>
>
> <p class="story">Once upon a time there were three little sisters; and
> their names were
> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
> and they lived at the bottom of a well.</p>
>
> <p class="story">...</p>
> """
> >>> print(soup)
> <html><head><title>The Dormouse's story</title></head>
> <body>
> <p class="title"><b>The Dormouse's story</b></p>
> <p class="story">Once upon a time there were three little sisters; and
> their names were
> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
> and they lived at the bottom of a well.</p>
> <p class="story">...</p>
> </body></html>
> >>>
>
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.
>
> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?
>
> The mutation itself would be things like finding an anchor tag and
> changing its href attribute. Fairly simple changes, but might alter
> the length of the file (eg changing "http://example.com/" into
> "https://example.com/"). I'd like to do them intelligently rather than
> falling back on element.sourceline and element.sourcepos, but worst
> case, that's what I'll have to do (which would be fiddly).
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>


More information about the Python-list mailing list