Detect character encoding

"Martin v. Löwis" martin at v.loewis.de
Mon Dec 5 01:58:24 EST 2005


Martin P. Hellwig wrote:
>  From what I can remember is that they used an algorithm to create some 
> statistics of the specific page and compared that with statistic about 
> all kinds of languages and encodings and just mapped the most likely.

More hearsay: I believe language-based heuristics are common. You first
guess an encoding based on the bytes you see, then guess a language of 
the page. If you then get a lot of characters that should not appear
in texts of the language (e.g. a lot of umlaut characters in a French
page), you know your guess was wrong, and you try a different language
for that encoding. If you run out of languages, you guess a different
encoding.

Mozilla can guess the encoding if you tell it what the language is,
which sounds like a similar approach.

Regards,
Martin



More information about the Python-list mailing list