Detect character encoding

Martin P. Hellwig mhellwig at xs4all.nl
Sun Dec 4 17:12:44 EST 2005


Mike Meyer wrote:
> "Diez B. Roggisch" <deets at nospam.web.de> writes:
>> Michal wrote:
>>> is there any way how to detect string encoding in Python?
>>> I need to proccess several files. Each of them could be encoded in
>>> different charset (iso-8859-2, cp1250, etc). I want to detect it,
>>> and encode it to utf-8 (with string function encode).
>> But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
>> file is "legal" in all encodings.
> 
> Not quite. Some encodings don't use all the valid 8-bit characters, so
> if you encounter a character not in an encoding, you can eliminate it
> from the list of possible encodings. This doesn't really help much by
> itself, though.
> 
>         <mike

I read or heard (can't remember the origin) that MS IE has a quite good 
implementation of guessing the language en character encoding of web 
pages when there not or falsely specified.
 From what I can remember is that they used an algorithm to create some 
statistics of the specific page and compared that with statistic about 
all kinds of languages and encodings and just mapped the most likely.

Please be aware that I don't know if the above has even the slightest 
amount of truth in it, however it didn't prevent me from posting anyway ;-)

-- 
mph



More information about the Python-list mailing list