Mutating an HTML file with BeautifulSoup

Peter J. Holzer hjp-python at hjp.at
Mon Aug 22 14:34:13 EDT 2022


On 2022-08-22 00:45:56 -0000, Jon Ribbens via Python-list wrote:
> With the offset though, BeautifulSoup made an arbitrary decision to
> use ISO-8859-1 encoding and so when you chopped the bytestring at
> that offset it only worked because BeautifulSoup had happened to
> choose a 1-byte-per-character encoding. Ironically, *without* the
> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.

Actually it would. The unit is bytes if you feed it with bytes, and
characters if you feed it with str. So in any case you can use the
offset on the data you fed to the parser. Maybe not what you expected,
but seems quite useful for what Chris has in mind.

(OTOH it seems that the html parser doesn't heed any <meta charset>
tags, which seems less than ideal for more pedestrian purposes.)

> > So I would probably just let this one go through as 8859-1.
> 
> It looks like BeautifulSoup is doing something like that, yes.
> Personally I would be nervous about some of my files being parsed
> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
> than some of the files actually *being* ISO-8859-1 ;-) )

Since none of the syntactically meaningful characters have a code >=
0x80, you can parse HTML at the byte level if you know that it's encoded
in a strict superset of ASCII (which all of the ISO-8859 family and
UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16
(or Shift-JIS  or EUC, if I remember correctly) then you have to know
the the character set.

(By parsing I mean only "create a syntax tree". Obviously you have to
know the encoding to know whether to display «c3 bc» as «ü» or «Ã¼».)

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20220822/404ca644/attachment.sig>


More information about the Python-list mailing list