Mutating an HTML file with BeautifulSoup

Sun Aug 21 17:19:33 EDT 2022

On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
<python-list at python.org> wrote:
>
> On 2022-08-21, Chris Angelico <rosuav at gmail.com> wrote:
> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
> ><python-list at python.org> wrote:
> >> On 2022-08-20, Chris Angelico <rosuav at gmail.com> wrote:
> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> >> >> 2QdxY4RzWzUUiLuE at potatochowder.com writes:
> >> >> >textual representations.  That way, the following two elements are the
> >> >> >same (and similar with a collection of sub-elements in a different order
> >> >> >in another document):
> >> >>
> >> >>   The /elements/ differ. They have the /same/ infoset.
> >> >
> >> > That's the bit that's hard to prove.
> >> >
> >> >>   The OP could edit the files with regexps to create a new version.
> >> >
> >> > To you and Jon, who also suggested this: how would that be beneficial?
> >> > With Beautiful Soup, I have the line number and position within the
> >> > line where the tag starts; what does a regex give me that I don't have
> >> > that way?
> >>
> >> You mean you could use BeautifulSoup to read the file and identify the
> >> bits you want to change by line number and offset, and then you could
> >> use that data to try and update the file, hoping like hell that your
> >> definition of "line" and "offset" are identical to BeautifulSoup's
> >> and that you don't mess up later changes when you do earlier ones (you
> >> could do them in reverse order of line and offset I suppose) and
> >> probably resorting to regexps anyway in order to find the part of the
> >> tag you want to change ...
> >>
> >> ... or you could avoid all that faff and just do re.sub()?
> >
> > Stefan answered in part, but I'll add that it is far FAR easier to do
> > the analysis with BS4 than regular expressions. I'm not sure what
> > "hoping like hell" is supposed to mean here, since the line and offset
> > have been 100% accurate in my experience;
>
> Given the string:
>
>     b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?"
>
> what is the line number and offset of the question mark - and does
> BeautifulSoup agree with your answer? Does the answer to that second
> question change depending on what parser you tell BeautifulSoup to use?

I'm not sure, because I don't know how to ask BS4 about the location
of a question mark. But I replaced that with a tag, and:

>>> raw = b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8<body></body>"
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(raw, "html.parser")
>>> soup.body.sourceline
4
>>> soup.body.sourcepos
12
>>> raw.split(b"\n")[3]
b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8<body></body>'
>>> raw.split(b"\n")[3][12:]
b'<body></body>'

So, yes, it seems to be correct. (Slightly odd in that the sourceline
is 1-based but the sourcepos is 0-based, but that is indeed the case,
as confirmed with a much more straight-forward string.)

And yes, it depends on the parser, but I'm using html.parser and it's fine.

> (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then
> I am happy with the program throwing an exception" then feel free to
> remove that substring from the question.)

Malformed UTF-8 doesn't seem to be a problem. Every file here seems to
be either UTF-8 or ISO-8859, and in the latter case, I'm assuming
8859-1. So I would probably just let this one go through as 8859-1.

> > the only part I'm unsure about is where the _end_ of the tag is (and
> > maybe there's a way I can use BS4 again to get that??).
>
> There doesn't seem to be. More to the point, there doesn't seem to be
> a way to find out where the *attributes* are, so as I said you'll most
> likely end up using regexps anyway.

I'm okay with replacing an entire tag that needs to be changed.
Especially if I can replace just the opening tag, not the contents and
closing tag. And in fact, I may just do that part by scanning for an
unencoded greater-than, on the assumptions that (a) BS4 will correctly
encode any greater-thans in attributes, and (b) if there's a
mis-encoded one in the input, the diff will be small enough to
eyeball, and a human should easily notice that the text has been
massively expanded and duplicated.

ChrisA