Mutating an HTML file with BeautifulSoup

dn PythonList at DancesWithMice.info
Sat Aug 20 23:39:33 EDT 2022


On 21/08/2022 13.00, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 09:48, dn <PythonList at danceswithmice.info> wrote:
>> On 20/08/2022 12.38, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 10:19, dn <PythonList at danceswithmice.info> wrote:
>>>> On 20/08/2022 09.01, Chris Angelico wrote:
>>>>> On Sat, 20 Aug 2022 at 05:12, Barry <barry at barrys-emacs.org> wrote:
>>>>>>> On 19 Aug 2022, at 19:33, Chris Angelico <rosuav at gmail.com> wrote:

>>>>> So I'm left with a few options:
>>>>>
>>>>> 1) Give up on validation, give up on verification, and just run this
>>>>> thing on the production site with my fingers crossed
>>>>> 2) Instead of doing an intelligent reconstruction, just str.replace()
>>>>> one URL with another within the file
>>>>> 3) Split the file into lines, find the Nth line (elem.sourceline) and
>>>>> str.replace that line only
>>>>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
>>>>> of the tag, manually find the end, and replace one tag with the
>>>>> reconstructed form.
>>>>>
>>>>> I'm inclined to the first option, honestly. The others just seem like
>>>>> hard work, and I became a programmer so I could be lazy...
>>>> +1 - but I've noticed that sometimes I have to work quite hard to be
>>>> this lazy!
>>>
>>> Yeah, that's very true...
>>>
>>>> Am assuming that http -> https is not the only 'change' (if it were,
>>>> you'd just do that without BS). How many such changes are planned/need
>>>> checking? Care to list them?
>>
>> This project has many of the same 'smells' as a database-harmonisation
>> effort. Particularly one where 'the previous guy' used to use field-X
>> for certain data, but his replacement decided that field-Y 'sounded
>> better' (or some such user-logic). Arrrggghhhh!
>>
>> If you like head-aches, and users coming to you with ifs-buts-and-maybes
>> AFTER you've 'done stuff', this is your sort of project!
> 
> Well, I don't like headaches, but I do appreciate what the G&S Archive
> has given me over the years, so I'm taking this on as a means of
> giving back to the community.

This point will be picked-up in the conclusion. NB in the same way that
you want to 'give back', so also do others - even if in minor ways or
'when-relevant'!


>>> Assumption is correct. The changes are more of the form "find all the
>>> problems, add to the list of fixes, try to minimize the ones that need
>>> to be done manually". So far, what I have is:
>>
>> Having taken the trouble to identify this list of improvements and given
>> the determination to verify each, consider working through one item at a
>> time, rather than in a single pass. This will enable individual logging
>> of changes, a manual check of each alteration, and the ability to
>> choose/tailor the best tool for that specific task.
>>
>> In fact, depending upon frequency, making the changes manually (and with
>> improved confidence in the result).
> 
> Unfortunately the frequency is very high.

Screechingly so? Like you're singing Three Little Maids?


>> The presence of (or allusion to) the word "some" in this list-items is
>> 'the killer'. Automation doesn't like 'some' (cf "all") unless the
>> criteria can be clearly and unambiguously defined. Ouch!
>>
>> (I don't think you need to be told any of this, but hey: dreams are free!)
> 
> Right; the criteria are quite well defined, but I omitted the details
> for brevity.
> 
>>> 1) A bunch of http -> https, but not all of them - only domains where
>>> I've confirmed that it's valid
>>
>> The search-criteria is the list of valid domains, rather than the
>> "http/https" which is likely the first focus.
> 
> Yeah. I do a first pass to enumerate all domains that are ever linked
> to with http:// URLs, and then I have a script that goes through and
> checks to see if they redirect me to the same URL on the other
> protocol, or other ways of checking. So yes, the list of valid domains
> is part of the program's effective input.

Wow! Having got that far, you have achieved data-validity. Is there a
need to perform a before-after check or diff?

Perhaps start making the one-for-one replacements without further
anxiety. As long as there's no silly-mistake, eg failing to remove an
opening or closing angle-bracket; isn't that about all the checking needed?
(for this category of updates)


>>> 2) Some absolute to relative conversions:
>>> https://www.gsarchive.net/whowaswho/index.htm should be referred to as
>>> /whowaswho/index.htm instead
>>
>> Similarly, if you have a list of these.
> 
> It's more just the pattern "https://www.gsarchive.net/<anything>" and
> "https://gsarchive.net/<anything>", and the corresponding "http://"
> URLs, plus a few other malformed versions that are worth correcting
> (if ever I find a link to "www.gsarchive.net/<anything>", it's almost
> certainly missing its protocol).

Isn't the inspection tool (described elsewhere) reporting an HTML/editor
line number?

That being the case, won't a bit of Swiss-Army knife Python-string work
enable appropriate processing and re-writing - as well as providing the
means to statistically-sample for QA?


>>> 3) A few outdated URLs for which we know the replacement, eg
>>> http://www.cris.com/~oakapple/gasdisc/<anything> to
>>> http://www.gasdisc.oakapplepress.com/<anything> (this one can't go on
>>> HTTPS, which is one reason I can't shortcut that)
>>
>> Again.
> 
> Same; although those are manually entered as patterns.

Ah "manual". The trusty IDE/editor as your friend...

Although, if the same pattern appears in multiple places, maybe it would
be possible to create a template-solution, and then apply to specific
filename and line-number combinations?


BTW in talk of "line-number", you will have realised the need to re-run
the identification of such after each of these steps - in case the 'new
stuff' relating to earlier steps (assuming above became also a temporal
sequence) is shorter/longer than the current HTML.


>>> 4) Some internal broken links where the path is wrong - anything that
>>> resolves to /books/<anything> but can't be found might be better
>>> rewritten as /html/perf_grps/websites/<anything> if the file can be
>>> found there
>>
>> Again.
> 
> The fixups are manually entered, but I also need to know about every
> broken internal link so that I can look through them and figure out
> what's wrong.

OK, so it is assumed that the link-checker tool is performing the
problem identification/location, and that's as far as anyone/any
computer, could hope to go.


>>> 5) Any external link that yields a permanent redirect should, to save
>>> clientside requests, get replaced by the destination. We have some
>>> Creative Commons badges that have moved to new URLs.
>>
>> Do you have these as a list, or are you intending the automated-method
>> to auto-magically follow the link to determine any need for action?
> 
> The same script that checks for http->https conversion probes all
> links and checks to see if (a) it returns a perm redirect, or (b) it
> returns an error. Fix the first group, log the second, leave anything
> else alone.

QED!


>>> And there'll be other fixes to be done too. So it's a bit complicated,
>>> and no simple solution is really sufficient. At the very very least, I
>>> *need* to properly parse with BS4; the only question is whether I
>>> reconstruct from the parse tree, or go back to the raw file and try to
>>> edit it there.
>>
>> At least the diffs would give you something to work-from, but it's a bit
>> like git-diffs claiming a 'change' when the only difference is that my
>> IDE strips blanks from the ends of code-lines, or some-such silliness.
> 
> Right; and the reconstructed version has a LOT of those unnecessary
> changes. I'm seeing a lot of changes to whitespace. The only problem
> is whether I can be confident that none of those changes could ever
> matter.

"White-space" has lesser-meaning in HTML - this is NOT Python! In HTML
if I write "HTML  file" (with two spaces), the browser will shorten the
display to a single space (hence some uses of   - non-broken
space). Similarly, if attempt to use "\n" to start a new line of text...

Is there a danger of 'chasing your own tail', ie seeking a solution to a
problem which really doesn't matter (particularly if we add the phrase:
at the user-level)?

Which brings-up the question of the efficacy of diffs cf the efficacy of
BS4's contributions?

(see later: statistical QA cf 100% confirmations)


>> Which brings me to ask: why "*need* to properly parse with BS4"?
> 
> Well, there's a *need to properly parse*, because I don't want to
> summon "the One whose Name cannot be expressed in the Basic
> Multilingual Plane" by using regular expressions on HTML. Am open to
> other suggestions; BS4 is the single most obvious one, but by no means
> the only way to do things.

Agree with "properly parse". Question was an apparent dedication to BS4
when there are other tools. Just checking you aren't wearing that type
of 'blinders'.
(didn't think so, but...)


>> What about selective use of tools, previously-mentioned in this thread?
> 
> I've answered the option of regular expressions; did I miss any other
> HTML-aware tools being mentioned? If so, my apologies, and feel free
> to remind me.
> 
>> Is Selenium worthy of consideration?
> 
> Yes..... but I don't know how much it would buy me. It certainly has
> no options for editing back the original HTML, so all it would do is
> the parsing side of things (which is already working fine).

In which case, no gain.
(I probably use it more than BS - but because it is useful to 'test'
web-pages, GUI behavior, etc)

My thinking was to start with [a parser] and deal with a sub-set of the
list of 'problems', one (separate) routine at a time. Either-mentioned
will do the job.

A better 'diff' would not look at the HTML, but compare the web-page's
before and after 'appearances'!
(after all, the link-checker has already figured-out the 'behind the
scenes' part)


>> I'm assuming you've already been using a link-checker utility to locate
>> the links which need to be changed. They can be used in QA-mode
>> after-the-fact too.
> 
> I actually haven't, but only because I figured that the autofixer
> would do the same job as the link-checker. Or rather, I wrote my own
> link-checker because I needed it to do more. And again, most standard
> utilities merely list the problems, they don't have a way to fix them.

Discussed throughout this reply.
(most were written a long, long, time ago - pre-dating the widespread
introduction (and now insistence?) of https, JS advantages, etc)


>>> For the record, I have very long-term plans to migrate parts of the
>>> site to Markdown, which would make a lot of things easier. But for
>>> now, I need to fix the existing problems in the existing HTML files,
>>> without doing gigantic wholesale layout changes.
>>
>> ...and there's another option. If the Markdown conversion is done first,
>> it will obviate any option of diffs completely. However, it will
>> introduce a veritable cornucopia of opportunity for this and 'other
>> stuff' to go-wrong, bringing us back to a page-by-page check or
>> broad-checks only, and an appeal to readers to report problems.
> 
> Yeah, and the fundamental problem with the MD conversion is time -
> it's a big manual process. I want to be able to do that progressively
> over time, but get the basic stuff sorted out much sooner. Ideally, it
> should be possible to fix all the autofixable links this week and get
> that sorted out, but converting pages to Markdown will happen slowly
> over the next few years.

Not something I've ever needed to consider. Are there no tools for this?


>> The (PM-oriented) observation is that if you are baulking at the amount
>> of work 'now', you'll be equally dismayed by the consequences of a
>> subsequent 'Markdown project'!
> 
> Nah, there's no rush on it, and I know from experience how much
> benefit it can give :)

Warning: more 'normal people' know something of HTML, than do of
Markdown. In fact, whereas many people from outside of IT attend our
HTML5 courses (implicit disclaimer!), I'll suggest that if a person
knows Markdown (s)he is at least 85% likely to be in IT.


Another ([in]famous dn off-the-wall) question: have you considered
'crowd-sourcing' the project? There are bound to be members who
particularly favor one operetta or song. If the project were moved to a
wiki or software like WordPress, would individual members be prepared to
copy-paste from 'the old' to 'the new', checking links and copy-pasting
the URL, etc, as they go?

You might then become PM with responsibility for listing all the
existing pages and crossing them off as 'converted'. Also, dealing with
'complicated matters', eg links which the member is unable to replace.

NB I recall (somewhere) a claim (about "distraction" and "consumption"
cf "creativity") that if all Americans were to stop watching
("consuming") TV for a single weekend, it would release sufficient time
to "create" Wikipedia in its entirety.


>> Perhaps, therefore, some counter-intuitive logic, eg combining the
>> two/biting two bullets/recognising that many of risks and likelihoods of
>> error overlap (rather than add/multiply).
> 
> That's true, and for new pages, it's way easier to handle (for
> instance, this page https://gsarchive.net/html/dixon.html did not
> exist prior to my curatorship - for obvious reasons - and I created it
> as a Markdown file).

Well, I'm not just a pretty face - and most of the time, not even that!


>> 'Bit rot' is so common in today's world, do readers treat such
>> pages/sites particularly differently?
> 
> That's what I am unsure of, and why I would prefer to make as few
> unnecessary changes as possible. However, I am leaning more and more
> strongly towards "just let BS4 do its canonicalization", given that
> all the alternatives posted here have been worse.

Good!

Remember, the more important measure is what the users/members think -
not how IT-perfect it might be.

This is the opposite of the mantra I (over-)frequently recite to
trainees (particularly those who have just learned JS and think they can
now 'take-over the world'...) "just because we can do it, doesn't make
it a good idea"!

I feel, whilst a modest contribution, almost worthy of G&S!


>> Somewhat conversely, even in our 'release-often, break-early' world, do
>> users often exert themselves to provide constructive feedback, eg 'link
>> broken'?
> 
> Maybe? But there are always pages that only a few people ever look at
> (this is a vast archive and some of its content is *extremely* niche),
> so I would prefer to preempt the issues.
> 
> Appreciate the thoughts.

I'm conscious that some of the thoughts have come from having a
Consulting/PM 'hat' on; whereas others address more technical aspects.

Similarly, is it possible that you are attempting to be "the very model
of a modern Major-General", whilst the service required may be more on
the line about being "very good at [la, la, la; something or other - he
can't remember the words)] and calculus" (IT not invented back in the
days of the Pirates of Penzance)?

Looking at the site, (unkindly-speaking) it is reminiscent of
AOL/GeoCities days. Which is not necessarily 'bad', but is indicative of
a membership who care less about appearances and more about 'the
business' of the group.

Accordingly, and referring back to the responses people have to
dead-lines, and such-like, isn't it *more*-likely that such members will
report 'issues'?

Thus, would inviting such communications, perhaps by putting a 'banner'
on every page, be likely to yield constructive results without
generating (unreasonable) complaint?

In turn, this might mean lowering your personal standards and exceptions
on the 'calculus' side,and  accepting that it may not be perfect AND
assuming that the users are committed to the cause and therefore more
interested in being constructive and part of the improvements project
(than in carping from the side-lines) - rather than trying to make the
work look like that of someone who has given fault-less service for
decades and enjoyed many commensurate promotions, Mr Major-General, sir?

In short (referring back to the 'list of options', above, top), there's
no need to "give up" (and I can't see you allowing yourself to do-so
anyway) but perhaps grant yourself permission to accept a result
(slightly) less than 100%!
(and yes, maybe that is the "lazy" coming-out, but there's also a sense
of YAGNI when investing significant effort into infrequently used
resources/viewed web-pages - maybe apply the 80:20 'rule'?)


Meantime, I'm off to find my boxed-set and play some CDs of 'silly
songs' while I work...
-- 
Regards,
=dn


More information about the Python-list mailing list