Mutating an HTML file with BeautifulSoup

Tue Aug 23 16:46:35 EDT 2022

On 2022-08-22 19:27:28 -0000, Jon Ribbens via Python-list wrote:
> On 2022-08-22, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > On 2022-08-22 00:45:56 -0000, Jon Ribbens via Python-list wrote:
> >> With the offset though, BeautifulSoup made an arbitrary decision to
> >> use ISO-8859-1 encoding and so when you chopped the bytestring at
> >> that offset it only worked because BeautifulSoup had happened to
> >> choose a 1-byte-per-character encoding. Ironically, *without* the
> >> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.
> >
> > Actually it would. The unit is bytes if you feed it with bytes, and
> > characters if you feed it with str.
> 
> No it isn't. If you give BeautifulSoup's 'html.parser' bytes as input,
> it first chooses an encoding and decodes the bytes before sending that
> output to html.parser, which is what provides the offset. So the offsets
> it gives are in characters, and you've no simple way of converting that
> back to byte offsets.

Ah, I see. It "worked" for me because "\xed\xa0\x80\xed\xbc\x9f" isn't
valid UTF-8. So Beautifulsoup decided to ignore the "<meta
charset='utf-8'>" I had inserted before and used ISO-8859-1, providing
me with correct byte offsets. If I replace that gibberish with a correct
UTF-8 sequence (e.g. "\x4B\xC3\xA4\x73\x65") the UTF-8 is decoded and I
get a character offset.

> >> It looks like BeautifulSoup is doing something like that, yes.
> >> Personally I would be nervous about some of my files being parsed
> >> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
> >> than some of the files actually *being* ISO-8859-1 ;-) )
> >
> > Since none of the syntactically meaningful characters have a code >=
> > 0x80, you can parse HTML at the byte level if you know that it's encoded
> > in a strict superset of ASCII (which all of the ISO-8859 family and
> > UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16
> > (or Shift-JIS  or EUC, if I remember correctly) then you have to know
> > the the character set.
> >
> > (By parsing I mean only "create a syntax tree". Obviously you have to
> > know the encoding to know whether to display =ABc3 bc=BB as =AB=FC=BB or =
> >=AB=C3=BC=BB.)
> 
> But the job here isn't to create a syntax tree. It's to change some of
> the content, which for all we know is not ASCII.

We know it's URLs, and the canonical form of an URL is ASCII. The URLs
in the files may not be, but if they aren't you'll have to deal with
variants anyway. And the start and end of the attribute can be
determined in any strict superset of ASCII including UTF-8.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20220823/fd9951d9/attachment.sig>