Mutating an HTML file with BeautifulSoup

2QdxY4RzWzUUiLuE at potatochowder.com 2QdxY4RzWzUUiLuE at potatochowder.com
Fri Aug 19 16:05:14 EDT 2022


On 2022-08-19 at 20:12:35 +0100,
Barry <barry at barrys-emacs.org> wrote:

> > On 19 Aug 2022, at 19:33, Chris Angelico <rosuav at gmail.com> wrote:
> > 
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
> 
> I recall that in bs4 it parses into an object tree and loses the
> detail of the input.  I recently ported from very old bs to bs4 and
> hit the same issue.  So no it will not output the same as went in.
> 
> If you can trust the input to be parsed as xml, meaning all the rules
> of closing tags have been followed. Then I think you can parse and
> unparse thru xml to do what you want.

XML is in the same boat.  Except for "canonical form" (which underlies
cryptographically signed XML documents) the standards explicitly don't
require tools to round-trip the "source code."  The preferred method of
comparing XML documents is at the structural level rather than with
textual representations.  That way, the following two elements are the
same (and similar with a collection of sub-elements in a different order
in another document):

    <e a="b" c="d"/>

and

    <e c="d" a="b"/>

Dan


More information about the Python-list mailing list