Mutating an HTML file with BeautifulSoup

Fri Aug 19 20:38:07 EDT 2022

On Sat, 20 Aug 2022 at 10:19, dn <PythonList at danceswithmice.info> wrote:
>
> On 20/08/2022 09.01, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 05:12, Barry <barry at barrys-emacs.org> wrote:
> >>
> >>
> >>
> >>> On 19 Aug 2022, at 19:33, Chris Angelico <rosuav at gmail.com> wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing it with BeautifulSoup?
> >>
> >> I recall that in bs4 it parses into an object tree and loses the detail of the input.
> >> I recently ported from very old bs to bs4 and hit the same issue.
> >> So no it will not output the same as went in.
> >>
> >> If you can trust the input to be parsed as xml, meaning all the rules of closing
> >> tags have been followed. Then I think you can parse and unparse thru xml to
> >> do what you want.
> >>
> >
> >
> > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> > well. Thanks for trying, anyhow.
> >
> > So I'm left with a few options:
> >
> > 1) Give up on validation, give up on verification, and just run this
> > thing on the production site with my fingers crossed
> > 2) Instead of doing an intelligent reconstruction, just str.replace()
> > one URL with another within the file
> > 3) Split the file into lines, find the Nth line (elem.sourceline) and
> > str.replace that line only
> > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> > of the tag, manually find the end, and replace one tag with the
> > reconstructed form.
> >
> > I'm inclined to the first option, honestly. The others just seem like
> > hard work, and I became a programmer so I could be lazy...
> +1 - but I've noticed that sometimes I have to work quite hard to be
> this lazy!

Yeah, that's very true...

> Am assuming that http -> https is not the only 'change' (if it were,
> you'd just do that without BS). How many such changes are planned/need
> checking? Care to list them?
>

Assumption is correct. The changes are more of the form "find all the
problems, add to the list of fixes, try to minimize the ones that need
to be done manually". So far, what I have is:

1) A bunch of http -> https, but not all of them - only domains where
I've confirmed that it's valid
2) Some absolute to relative conversions:
https://www.gsarchive.net/whowaswho/index.htm should be referred to as
/whowaswho/index.htm instead
3) A few outdated URLs for which we know the replacement, eg
http://www.cris.com/~oakapple/gasdisc/<anything> to
http://www.gasdisc.oakapplepress.com/<anything> (this one can't go on
HTTPS, which is one reason I can't shortcut that)
4) Some internal broken links where the path is wrong - anything that
resolves to /books/<anything> but can't be found might be better
rewritten as /html/perf_grps/websites/<anything> if the file can be
found there
5) Any external link that yields a permanent redirect should, to save
clientside requests, get replaced by the destination. We have some
Creative Commons badges that have moved to new URLs.

And there'll be other fixes to be done too. So it's a bit complicated,
and no simple solution is really sufficient. At the very very least, I
*need* to properly parse with BS4; the only question is whether I
reconstruct from the parse tree, or go back to the raw file and try to
edit it there.

For the record, I have very long-term plans to migrate parts of the
site to Markdown, which would make a lot of things easier. But for
now, I need to fix the existing problems in the existing HTML files,
without doing gigantic wholesale layout changes.

ChrisA