Detecting Russian and Ukrainian character sets

Martin v. Loewis martin at v.loewis.de
Fri Sep 13 01:57:44 EDT 2002


Tim Churches <tchur at optushome.com.au> writes:

> Is this correct?  

I would claim that the problem is, for many applications, slightly
different. For a number of applications, in particular in the
internet, there are well-established procedures for communicating the
encoding/charset of data as meta-information. The problem then is that
applications often ignore those declarations, when they should really
consider them.

For other applications, in particular with text files, it is true that
the text file does not carry a charset declaration. However, a good
estimate is that you find out what computer system is being used, and
follow the local conventions of that system.

Both approaches may go wrong: the charset declaration in an HTML file
may be missing, and the text file may have been copied from one system
to another. But so may auto-detection of encodings.

That is not to say that auto-detection isn't useful - but I'd like to
point out that it should be only one of many options.

In addition to the Perl module, it appears that there are a number of
other autodetection modules: Mozilla supports Cyrillic auto-detection,
and the ru-xcode package (available atleast as a FreeBSD port) does
autodetection as well. I'm not aware of a Python module, but it
shouldn't be difficult to port any of these algorithms to Python.

Regards,
Martin



More information about the Python-list mailing list