How to get an encoding a value?

Fri Oct 22 12:53:35 EDT 2004

Diez B. Roggisch wrote:

> A common approach to guessing the encoding of said string is to try
> something like this:
> 
> s = <some string with unknown encoding>
> encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you
> expect for e in encodings:
> try:
> if s == s.decode(e).encode(e):
> break
> except UnicodeError:
> pass

However, you must be very careful with the order in which to test the
encodings. The example code will never detect "utf-8":

>>> s = "".join(map(chr, range(256)))
>>> s.decode("latin1").encode("latin1") == s
True

This equality holds for every encoding where one byte is one character and
uses the full range of 256 bytes/characters. You cannot discriminate
between such encodings using the above method:

>>> s.decode("latin1").encode("latin1") == s
True
>>> s.decode("latin2").encode("latin2") == s
True
>>> s.decode("latin2") == s.decode("latin1")
False

A statistical approach seems more promising, e. g. some smart variant of
"looking for umlauts" in a text known to be German.

Peter