Detect character encoding
Diez B. Roggisch
deets at nospam.web.de
Sun Dec 4 17:36:54 EST 2005
Mike Meyer wrote:
> "Diez B. Roggisch" <deets at nospam.web.de> writes:
>
>>Michal wrote:
>>
>>>is there any way how to detect string encoding in Python?
>>>I need to proccess several files. Each of them could be encoded in
>>>different charset (iso-8859-2, cp1250, etc). I want to detect it,
>>>and encode it to utf-8 (with string function encode).
>>
>>But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
>>file is "legal" in all encodings.
>
>
> Not quite. Some encodings don't use all the valid 8-bit characters, so
> if you encounter a character not in an encoding, you can eliminate it
> from the list of possible encodings. This doesn't really help much by
> itself, though.
----- test.py
for enc in ["cp1250", "latin1", "iso-8859-2"]:
print enc
try:
str.decode("".join([chr(i) for i in xrange(256)]), enc)
except UnicodeDecodeError, e:
print e
-----
192:~ deets$ python2.4 /tmp/test.py
cp1250
'charmap' codec can't decode byte 0x81 in position 129: character maps
to <undefined>
latin1
iso-8859-2
So cp1250 doesn't have all codepoints defined - but the others have.
Sure, this helps you to eliminate 1 of the three choices the OP wanted
to choose between - but how many texts you have that have a 129 in them?
Regards,
Diez
More information about the Python-list
mailing list