how to detect the character encoding in a web page ?

Nobody nobody at nowhere.com
Thu Jun 6 02:22:37 EDT 2013


On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:

> The HTTP header is completely out of band. This is the best way to
> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
> parsing. Once you find a meta tag, you stop parsing and go back to the
> top, decoding in the new way.

Provided that the meta tag indicates an ASCII-compatible encoding, and you
haven't encountered any decode errors due to 8-bit characters, then
there's no need to go back to the top.

> "ASCII-compatible" covers a huge number of
> encodings, so it's not actually much of a problem to do this.

With slight modifications, you can also handle some
almost-ASCII-compatible encodings such as shift-JIS.

Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
have actually been seen, and only re-start parsing from the top if the
encoding change actually affects the interpretation of any of those bytes.

And if the encoding isn't even remotely ASCII-compatible, you aren't going
to be able to recognise the meta tag in the first place. But I don't think
I've ever seen a web page encoded in UTF-16 or EBCDIC.

Tools like chardet are meant for the situation where either no encoding is
specified or the specified encoding can't be trusted (which is rather
common; why else would web browsers have a menu to allow the user to
select the encoding?).




More information about the Python-list mailing list