Detect character encoding
"Martin v. Löwis"
martin at v.loewis.de
Mon Dec 5 01:58:24 EST 2005
Martin P. Hellwig wrote:
> From what I can remember is that they used an algorithm to create some
> statistics of the specific page and compared that with statistic about
> all kinds of languages and encodings and just mapped the most likely.
More hearsay: I believe language-based heuristics are common. You first
guess an encoding based on the bytes you see, then guess a language of
the page. If you then get a lot of characters that should not appear
in texts of the language (e.g. a lot of umlaut characters in a French
page), you know your guess was wrong, and you try a different language
for that encoding. If you run out of languages, you guess a different
encoding.
Mozilla can guess the encoding if you tell it what the language is,
which sounds like a similar approach.
Regards,
Martin
More information about the Python-list
mailing list