Mutating an HTML file with BeautifulSoup

Fri Aug 19 23:11:44 EDT 2022

On 2022-08-19, Chris Angelico <rosuav at gmail.com> wrote:
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
>>>> html_doc = """<html><head><title>The Dormouse's story</title></head>
><body>
><p class="title"><b>The Dormouse's story</b></p>
>
><p class="story">Once upon a time there were three little sisters; and
> their names were
><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
> and they lived at the bottom of a well.</p>
>
><p class="story">...</p>
> """
>>>> print(soup)
><html><head><title>The Dormouse's story</title></head>
><body>
><p class="title"><b>The Dormouse's story</b></p>
><p class="story">Once upon a time there were three little sisters; and
> their names were
><a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
><a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
> and they lived at the bottom of a well.</p>
><p class="story">...</p>
></body></html>
>>>>
>
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.
>
> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?
>
> The mutation itself would be things like finding an anchor tag and
> changing its href attribute. Fairly simple changes, but might alter
> the length of the file (eg changing "http://example.com/" into
> "https://example.com/"). I'd like to do them intelligently rather than
> falling back on element.sourceline and element.sourcepos, but worst
> case, that's what I'll have to do (which would be fiddly).

I'm tempting the Wrath of Zalgo by saying it, but ... regexp?