Detect character encoding

Diez B. Roggisch deets at nospam.web.de
Sun Dec 4 17:36:54 EST 2005


Mike Meyer wrote:
> "Diez B. Roggisch" <deets at nospam.web.de> writes:
> 
>>Michal wrote:
>>
>>>is there any way how to detect string encoding in Python?
>>>I need to proccess several files. Each of them could be encoded in
>>>different charset (iso-8859-2, cp1250, etc). I want to detect it,
>>>and encode it to utf-8 (with string function encode).
>>
>>But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
>>file is "legal" in all encodings.
> 
> 
> Not quite. Some encodings don't use all the valid 8-bit characters, so
> if you encounter a character not in an encoding, you can eliminate it
> from the list of possible encodings. This doesn't really help much by
> itself, though.


----- test.py
for enc in ["cp1250", "latin1", "iso-8859-2"]:
     print enc
     try:
	str.decode("".join([chr(i) for i in xrange(256)]), enc)
     except UnicodeDecodeError, e:
	print e
-----

192:~ deets$ python2.4 /tmp/test.py
cp1250
'charmap' codec can't decode byte 0x81 in position 129: character maps 
to <undefined>
latin1
iso-8859-2

So cp1250 doesn't have all codepoints defined - but the others have. 
Sure, this helps you to eliminate 1 of the three choices the OP wanted 
to choose between - but how many texts you have that have a 129 in them?

Regards,

Diez



More information about the Python-list mailing list