Mutating an HTML file with BeautifulSoup

Barry barry at barrys-emacs.org
Fri Aug 19 15:12:35 EDT 2022



> On 19 Aug 2022, at 19:33, Chris Angelico <rosuav at gmail.com> wrote:
> 
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?

I recall that in bs4 it parses into an object tree and loses the detail of the input.
I recently ported from very old bs to bs4 and hit the same issue.
So no it will not output the same as went in.

If you can trust the input to be parsed as xml, meaning all the rules of closing
tags have been followed. Then I think you can parse and unparse thru xml to
do what you want.

Barry


> 
> Using the Alice example from the BS4 docs:
> 
>>>> html_doc = """<html><head><title>The Dormouse's story</title></head>
> <body>
> <p class="title"><b>The Dormouse's story</b></p>
> 
> <p class="story">Once upon a time there were three little sisters; and
> their names were
> <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
> and they lived at the bottom of a well.</p>
> 
> <p class="story">...</p>
> """
>>>> print(soup)
> <html><head><title>The Dormouse's story</title></head>
> <body>
> <p class="title"><b>The Dormouse's story</b></p>
> <p class="story">Once upon a time there were three little sisters; and
> their names were
> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
> and they lived at the bottom of a well.</p>
> <p class="story">...</p>
> </body></html>
>>>> 
> 
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.
> 
> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?
> 
> The mutation itself would be things like finding an anchor tag and
> changing its href attribute. Fairly simple changes, but might alter
> the length of the file (eg changing "http://example.com/" into
> "https://example.com/"). I'd like to do them intelligently rather than
> falling back on element.sourceline and element.sourcepos, but worst
> case, that's what I'll have to do (which would be fiddly).
> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 



More information about the Python-list mailing list