Mutating an HTML file with BeautifulSoup

Sat Aug 20 21:00:33 EDT 2022

On Sun, 21 Aug 2022 at 09:48, dn <PythonList at danceswithmice.info> wrote:
>
> On 20/08/2022 12.38, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 10:19, dn <PythonList at danceswithmice.info> wrote:
> >> On 20/08/2022 09.01, Chris Angelico wrote:
> >>> On Sat, 20 Aug 2022 at 05:12, Barry <barry at barrys-emacs.org> wrote:
> >>>>> On 19 Aug 2022, at 19:33, Chris Angelico <rosuav at gmail.com> wrote:
> >>>>>
> >>>>> What's the best way to precisely reconstruct an HTML file after
> >>>>> parsing it with BeautifulSoup?
> ...
>
> >>> well. Thanks for trying, anyhow.
> >>>
> >>> So I'm left with a few options:
> >>>
> >>> 1) Give up on validation, give up on verification, and just run this
> >>> thing on the production site with my fingers crossed
> >>> 2) Instead of doing an intelligent reconstruction, just str.replace()
> >>> one URL with another within the file
> >>> 3) Split the file into lines, find the Nth line (elem.sourceline) and
> >>> str.replace that line only
> >>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> >>> of the tag, manually find the end, and replace one tag with the
> >>> reconstructed form.
> >>>
> >>> I'm inclined to the first option, honestly. The others just seem like
> >>> hard work, and I became a programmer so I could be lazy...
> >> +1 - but I've noticed that sometimes I have to work quite hard to be
> >> this lazy!
> >
> > Yeah, that's very true...
> >
> >> Am assuming that http -> https is not the only 'change' (if it were,
> >> you'd just do that without BS). How many such changes are planned/need
> >> checking? Care to list them?
>
> This project has many of the same 'smells' as a database-harmonisation
> effort. Particularly one where 'the previous guy' used to use field-X
> for certain data, but his replacement decided that field-Y 'sounded
> better' (or some such user-logic). Arrrggghhhh!
>
> If you like head-aches, and users coming to you with ifs-buts-and-maybes
> AFTER you've 'done stuff', this is your sort of project!

Well, I don't like headaches, but I do appreciate what the G&S Archive
has given me over the years, so I'm taking this on as a means of
giving back to the community.

> > Assumption is correct. The changes are more of the form "find all the
> > problems, add to the list of fixes, try to minimize the ones that need
> > to be done manually". So far, what I have is:
>
> Having taken the trouble to identify this list of improvements and given
> the determination to verify each, consider working through one item at a
> time, rather than in a single pass. This will enable individual logging
> of changes, a manual check of each alteration, and the ability to
> choose/tailor the best tool for that specific task.
>
> In fact, depending upon frequency, making the changes manually (and with
> improved confidence in the result).

Unfortunately the frequency is very high.

> The presence of (or allusion to) the word "some" in this list-items is
> 'the killer'. Automation doesn't like 'some' (cf "all") unless the
> criteria can be clearly and unambiguously defined. Ouch!
>
> (I don't think you need to be told any of this, but hey: dreams are free!)

Right; the criteria are quite well defined, but I omitted the details
for brevity.

> > 1) A bunch of http -> https, but not all of them - only domains where
> > I've confirmed that it's valid
>
> The search-criteria is the list of valid domains, rather than the
> "http/https" which is likely the first focus.

Yeah. I do a first pass to enumerate all domains that are ever linked
to with http:// URLs, and then I have a script that goes through and
checks to see if they redirect me to the same URL on the other
protocol, or other ways of checking. So yes, the list of valid domains
is part of the program's effective input.

> > 2) Some absolute to relative conversions:
> > https://www.gsarchive.net/whowaswho/index.htm should be referred to as
> > /whowaswho/index.htm instead
>
> Similarly, if you have a list of these.

It's more just the pattern "https://www.gsarchive.net/<anything>" and
"https://gsarchive.net/<anything>", and the corresponding "http://"
URLs, plus a few other malformed versions that are worth correcting
(if ever I find a link to "www.gsarchive.net/<anything>", it's almost
certainly missing its protocol).

> > 3) A few outdated URLs for which we know the replacement, eg
> > http://www.cris.com/~oakapple/gasdisc/<anything> to
> > http://www.gasdisc.oakapplepress.com/<anything> (this one can't go on
> > HTTPS, which is one reason I can't shortcut that)
>
> Again.

Same; although those are manually entered as patterns.

> > 4) Some internal broken links where the path is wrong - anything that
> > resolves to /books/<anything> but can't be found might be better
> > rewritten as /html/perf_grps/websites/<anything> if the file can be
> > found there
>
> Again.

The fixups are manually entered, but I also need to know about every
broken internal link so that I can look through them and figure out
what's wrong.

> > 5) Any external link that yields a permanent redirect should, to save
> > clientside requests, get replaced by the destination. We have some
> > Creative Commons badges that have moved to new URLs.
>
> Do you have these as a list, or are you intending the automated-method
> to auto-magically follow the link to determine any need for action?

The same script that checks for http->https conversion probes all
links and checks to see if (a) it returns a perm redirect, or (b) it
returns an error. Fix the first group, log the second, leave anything
else alone.

> > And there'll be other fixes to be done too. So it's a bit complicated,
> > and no simple solution is really sufficient. At the very very least, I
> > *need* to properly parse with BS4; the only question is whether I
> > reconstruct from the parse tree, or go back to the raw file and try to
> > edit it there.
>
> At least the diffs would give you something to work-from, but it's a bit
> like git-diffs claiming a 'change' when the only difference is that my
> IDE strips blanks from the ends of code-lines, or some-such silliness.

Right; and the reconstructed version has a LOT of those unnecessary
changes. I'm seeing a lot of changes to whitespace. The only problem
is whether I can be confident that none of those changes could ever
matter.

> Which brings me to ask: why "*need* to properly parse with BS4"?

Well, there's a *need to properly parse*, because I don't want to
summon "the One whose Name cannot be expressed in the Basic
Multilingual Plane" by using regular expressions on HTML. Am open to
other suggestions; BS4 is the single most obvious one, but by no means
the only way to do things.

> What about selective use of tools, previously-mentioned in this thread?

I've answered the option of regular expressions; did I miss any other
HTML-aware tools being mentioned? If so, my apologies, and feel free
to remind me.

> Is Selenium worthy of consideration?

Yes..... but I don't know how much it would buy me. It certainly has
no options for editing back the original HTML, so all it would do is
the parsing side of things (which is already working fine).

> I'm assuming you've already been using a link-checker utility to locate
> the links which need to be changed. They can be used in QA-mode
> after-the-fact too.

I actually haven't, but only because I figured that the autofixer
would do the same job as the link-checker. Or rather, I wrote my own
link-checker because I needed it to do more. And again, most standard
utilities merely list the problems, they don't have a way to fix them.

> > For the record, I have very long-term plans to migrate parts of the
> > site to Markdown, which would make a lot of things easier. But for
> > now, I need to fix the existing problems in the existing HTML files,
> > without doing gigantic wholesale layout changes.
>
> ...and there's another option. If the Markdown conversion is done first,
> it will obviate any option of diffs completely. However, it will
> introduce a veritable cornucopia of opportunity for this and 'other
> stuff' to go-wrong, bringing us back to a page-by-page check or
> broad-checks only, and an appeal to readers to report problems.

Yeah, and the fundamental problem with the MD conversion is time -
it's a big manual process. I want to be able to do that progressively
over time, but get the basic stuff sorted out much sooner. Ideally, it
should be possible to fix all the autofixable links this week and get
that sorted out, but converting pages to Markdown will happen slowly
over the next few years.

> The (PM-oriented) observation is that if you are baulking at the amount
> of work 'now', you'll be equally dismayed by the consequences of a
> subsequent 'Markdown project'!

Nah, there's no rush on it, and I know from experience how much
benefit it can give :)

> Perhaps, therefore, some counter-intuitive logic, eg combining the
> two/biting two bullets/recognising that many of risks and likelihoods of
> error overlap (rather than add/multiply).

That's true, and for new pages, it's way easier to handle (for
instance, this page https://gsarchive.net/html/dixon.html did not
exist prior to my curatorship - for obvious reasons - and I created it
as a Markdown file).

> 'Bit rot' is so common in today's world, do readers treat such
> pages/sites particularly differently?

That's what I am unsure of, and why I would prefer to make as few
unnecessary changes as possible. However, I am leaning more and more
strongly towards "just let BS4 do its canonicalization", given that
all the alternatives posted here have been worse.

> Somewhat conversely, even in our 'release-often, break-early' world, do
> users often exert themselves to provide constructive feedback, eg 'link
> broken'?

Maybe? But there are always pages that only a few people ever look at
(this is a vast archive and some of its content is *extremely* niche),
so I would prefer to preempt the issues.

Appreciate the thoughts.

ChrisA