Detect character encoding

"Martin v. Löwis" martin at v.loewis.de
Mon Dec 5 02:01:25 EST 2005


Diez B. Roggisch wrote:
> So cp1250 doesn't have all codepoints defined - but the others have. 
> Sure, this helps you to eliminate 1 of the three choices the OP wanted 
> to choose between - but how many texts you have that have a 129 in them?

For the iso8859 ones, you should assume that the characters in
range(128, 160) really aren't used. If you get one of these, and it is
not utf-8, it is a Windows code page.

UTF-8 can be recognized pretty reliable: even though it allows all bytes
to appear, it is very constraint in what sequences of bytes it allows.
E.g. you can't have a single byte >127 in UTF-8; you need atleast two
of them subsequent, and they need to meet more constraints.

Regards,
Martin



More information about the Python-list mailing list