Detecting Russian and Ukrainian character sets

Tim Churches tchur at optushome.com.au
Thu Sep 12 17:04:22 EDT 2002


Here are two questions for Russian and Ukrainian Python users:

1) I understand that a common problem when processing text data 
collected from various sources in Russia and the Ukraine is 
the mixture of character sets which are used - MS-DOS, Windows,
Linux, Unix and mac machines may all use one (or more) of a number
of character sets to encode strings, and when such data are 
supplied in text files, there is usually no indication of
which character set was used. Is this correct?  
http://czyborra.com/charsets/cyrillic.html has a listing of
known Cyrillic character sets.

2) Are there any Python routines available for automatically
deducing which character set was used to encode a particular
text file (or a particular string)? There is a module for
Perl called Lingua:RU:Charset which seems to address this problem
(see
http://www.freebsd.org/cgi/url.cgi?ports/russian/p5-Lingua-RU-Charset/pkg-descr
)
at least for Russian encodings.

Tim C




More information about the Python-list mailing list