Mutating an HTML file with BeautifulSoup

Sun Aug 21 20:45:56 EDT 2022

On 2022-08-21, Chris Angelico <rosuav at gmail.com> wrote:
> On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
><python-list at python.org> wrote:
>> On 2022-08-21, Chris Angelico <rosuav at gmail.com> wrote:
>> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
>> ><python-list at python.org> wrote:
>> >> On 2022-08-20, Chris Angelico <rosuav at gmail.com> wrote:
>> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>> >> >> 2QdxY4RzWzUUiLuE at potatochowder.com writes:
>> >> >> >textual representations.  That way, the following two elements are the
>> >> >> >same (and similar with a collection of sub-elements in a different order
>> >> >> >in another document):
>> >> >>
>> >> >>   The /elements/ differ. They have the /same/ infoset.
>> >> >
>> >> > That's the bit that's hard to prove.
>> >> >
>> >> >>   The OP could edit the files with regexps to create a new version.
>> >> >
>> >> > To you and Jon, who also suggested this: how would that be beneficial?
>> >> > With Beautiful Soup, I have the line number and position within the
>> >> > line where the tag starts; what does a regex give me that I don't have
>> >> > that way?
>> >>
>> >> You mean you could use BeautifulSoup to read the file and identify the
>> >> bits you want to change by line number and offset, and then you could
>> >> use that data to try and update the file, hoping like hell that your
>> >> definition of "line" and "offset" are identical to BeautifulSoup's
>> >> and that you don't mess up later changes when you do earlier ones (you
>> >> could do them in reverse order of line and offset I suppose) and
>> >> probably resorting to regexps anyway in order to find the part of the
>> >> tag you want to change ...
>> >>
>> >> ... or you could avoid all that faff and just do re.sub()?
>> >
>> > Stefan answered in part, but I'll add that it is far FAR easier to do
>> > the analysis with BS4 than regular expressions. I'm not sure what
>> > "hoping like hell" is supposed to mean here, since the line and offset
>> > have been 100% accurate in my experience;
>>
>> Given the string:
>>
>>     b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?"
>>
>> what is the line number and offset of the question mark - and does
>> BeautifulSoup agree with your answer? Does the answer to that second
>> question change depending on what parser you tell BeautifulSoup to use?
>
> I'm not sure, because I don't know how to ask BS4 about the location
> of a question mark. But I replaced that with a tag, and:
>
>>>> raw = b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8<body></body>"
>>>> from bs4 import BeautifulSoup
>>>> soup = BeautifulSoup(raw, "html.parser")
>>>> soup.body.sourceline
> 4
>>>> soup.body.sourcepos
> 12
>>>> raw.split(b"\n")[3]
> b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8<body></body>'
>>>> raw.split(b"\n")[3][12:]
> b'<body></body>'
>
> So, yes, it seems to be correct. (Slightly odd in that the sourceline
> is 1-based but the sourcepos is 0-based, but that is indeed the case,
> as confirmed with a much more straight-forward string.)
>
> And yes, it depends on the parser, but I'm using html.parser and it's fine.

Hah, yes, it appears html.parser does an end-run about my lovely
carefully crafted hard case by not even *trying* to work out what
type of line endings the file uses and is just hard-coded to only
recognise "\n" as a line ending.

With the offset though, BeautifulSoup made an arbitrary decision to
use ISO-8859-1 encoding and so when you chopped the bytestring at
that offset it only worked because BeautifulSoup had happened to
choose a 1-byte-per-character encoding. Ironically, *without* the
"\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.

>> (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then
>> I am happy with the program throwing an exception" then feel free to
>> remove that substring from the question.)
>
> Malformed UTF-8 doesn't seem to be a problem. Every file here seems to
> be either UTF-8 or ISO-8859, and in the latter case, I'm assuming
> 8859-1. So I would probably just let this one go through as 8859-1.

It looks like BeautifulSoup is doing something like that, yes.
Personally I would be nervous about some of my files being parsed
as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
than some of the files actually *being* ISO-8859-1 ;-) )

>> > the only part I'm unsure about is where the _end_ of the tag is (and
>> > maybe there's a way I can use BS4 again to get that??).
>>
>> There doesn't seem to be. More to the point, there doesn't seem to be
>> a way to find out where the *attributes* are, so as I said you'll most
>> likely end up using regexps anyway.
>
> I'm okay with replacing an entire tag that needs to be changed.

Oh, that seems like quite a big change to the original problem.

> Especially if I can replace just the opening tag, not the contents and
> closing tag. And in fact, I may just do that part by scanning for an
> unencoded greater-than, on the assumptions that (a) BS4 will correctly
> encode any greater-thans in attributes,

But your input wasn't created by BeautifulSoup (was it?)

> and (b) if there's a mis-encoded one in the input, the diff will be
> small enough to eyeball, and a human should easily notice that the
> text has been massively expanded and duplicated.

I strongly suggest Stefan Ram's excellent suggestion that, regardless
of how you *make* the change, you can use BeautifulSoup to do a pretty
strong check that the changes effected are both (a) all the ones you
intended and (b) none that you didn't intend.