Mutating an HTML file with BeautifulSoup

Mon Aug 22 15:27:28 EDT 2022

On 2022-08-22, Peter J. Holzer <hjp-python at hjp.at> wrote:
> On 2022-08-22 00:45:56 -0000, Jon Ribbens via Python-list wrote:
>> With the offset though, BeautifulSoup made an arbitrary decision to
>> use ISO-8859-1 encoding and so when you chopped the bytestring at
>> that offset it only worked because BeautifulSoup had happened to
>> choose a 1-byte-per-character encoding. Ironically, *without* the
>> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.
>
> Actually it would. The unit is bytes if you feed it with bytes, and
> characters if you feed it with str.

No it isn't. If you give BeautifulSoup's 'html.parser' bytes as input,
it first chooses an encoding and decodes the bytes before sending that
output to html.parser, which is what provides the offset. So the offsets
it gives are in characters, and you've no simple way of converting that
back to byte offsets.

> (OTOH it seems that the html parser doesn't heed any <meta charset>
> tags, which seems less than ideal for more pedestrian purposes.)

html.parser doesn't accept bytes as input, so it couldn't do anything
with the encoding even if it knew it. BeautifulSoup's 'html.parser'
however does look for and use <meta charset> (using a regexp, natch).

>> It looks like BeautifulSoup is doing something like that, yes.
>> Personally I would be nervous about some of my files being parsed
>> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
>> than some of the files actually *being* ISO-8859-1 ;-) )
>
> Since none of the syntactically meaningful characters have a code >=
> 0x80, you can parse HTML at the byte level if you know that it's encoded
> in a strict superset of ASCII (which all of the ISO-8859 family and
> UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16
> (or Shift-JIS  or EUC, if I remember correctly) then you have to know
> the the character set.
>
> (By parsing I mean only "create a syntax tree". Obviously you have to
> know the encoding to know whether to display =ABc3 bc=BB as =AB=FC=BB or =
>=AB=C3=BC=BB.)

But the job here isn't to create a syntax tree. It's to change some of
the content, which for all we know is not ASCII.