Mutating an HTML file with BeautifulSoup

Fri Aug 19 20:10:47 EDT 2022

On Sat, 20 Aug 2022 at 10:04, David <bouncingcats at gmail.com> wrote:
>
> On Sat, 20 Aug 2022 at 04:31, Chris Angelico <rosuav at gmail.com> wrote:
>
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> > Note two distinct changes: firstly, whitespace has been removed, and
> > secondly, attributes are reordered (I think alphabetically). There are
> > other canonicalizations being done, too.
>
> > I'm trying to make some automated changes to a huge number of HTML
> > files, with minimal diffs so they're easy to validate. That means that
> > spurious changes like these are very much unwanted. Is there a way to
> > get BS4 to reconstruct the original precisely?
>
> On Sat, 20 Aug 2022 at 07:02, Chris Angelico <rosuav at gmail.com> wrote:
> > On Sat, 20 Aug 2022 at 05:12, Barry <barry at barrys-emacs.org> wrote:
>
> > > I recall that in bs4 it parses into an object tree and loses the detail
> > > of the input.  I recently ported from very old bs to bs4 and hit the
> > > same issue.  So no it will not output the same as went in.
>
> > So I'm left with a few options:
>
> > 1) Give up on validation, give up on verification, and just run this
> >    thing on the production site with my fingers crossed
>
> > 2) Instead of doing an intelligent reconstruction, just str.replace() one
> >    URL with another within the file
>
> > 3) Split the file into lines, find the Nth line (elem.sourceline) and
> >    str.replace that line only
>
> > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start of
> >    the tag, manually find the end, and replace one tag with the
> >    reconstructed form.
>
> > I'm inclined to the first option, honestly. The others just seem like
> > hard work, and I became a programmer so I could be lazy...
>
> Hi, I don't know if you will like this option, but I don't see it on the
> list yet so ...

Hey, all options are welcomed :)

> I'm assuming that the phrase "with minimal diffs so they're easy to
> validate" means being eyeballed by a human.
>
> Have you considered two passes through BS? Do the first pass with no
> modification, so that the intermediate result gets the BS default
> "spurious" changes.
>
> Then do the second pass with the desired changes, so that the human will
> see only the desired changes in the diff.

I'm 100% confident of the actual changes, so that wouldn't really
solve anything. The problem is that, without eyeballing the actual
changes, I can't easily see if there's been something else changed or
broken. This is a scripted change that will affect probably hundreds
of HTML files across a large web site, so making sure I don't break
anything means either (a) minimize the diff so it's clearly correct,
or (b) eyeball the rendered versions of every page - manually - to see
if there were any unintended changes. (There WILL be intended visual
changes, so I can't render the page to bitmap and ensure that it
hasn't changed. This is not React snapshot testing, which IMO is one
of the most useless testing features ever devised. No, actually, that
can't be true, someone MUST have made a worse one.)

Appreciate the suggestion, though!

ChrisA