Mutating an HTML file with BeautifulSoup

Chris Angelico rosuav at gmail.com
Fri Aug 19 14:30:17 EDT 2022


What's the best way to precisely reconstruct an HTML file after
parsing it with BeautifulSoup?

Using the Alice example from the BS4 docs:

>>> html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and
their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
>>> print(soup)
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and
their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>>>

Note two distinct changes: firstly, whitespace has been removed, and
secondly, attributes are reordered (I think alphabetically). There are
other canonicalizations being done, too.

I'm trying to make some automated changes to a huge number of HTML
files, with minimal diffs so they're easy to validate. That means that
spurious changes like these are very much unwanted. Is there a way to
get BS4 to reconstruct the original precisely?

The mutation itself would be things like finding an anchor tag and
changing its href attribute. Fairly simple changes, but might alter
the length of the file (eg changing "http://example.com/" into
"https://example.com/"). I'd like to do them intelligently rather than
falling back on element.sourceline and element.sourcepos, but worst
case, that's what I'll have to do (which would be fiddly).

ChrisA


More information about the Python-list mailing list