Determining Unicode encoding.

Tue Apr 29 17:02:13 EDT 2003

Sean wrote:

> The problem is that I don't know how to determine what
> the *right* encoding to use on a particular string is.  

Do you have this problem when reading a byte string, or when
writing it?

If you are given a byte string, and you are supposed to interpret
the bytes as characters, there is, in general, no good way to do
so - that's why people came up with the idea of a universal character
set in the first place, to overcome the problems with multiple
character sets.

That said, you can make educated guesses on the data you read.

1. Perhaps the data you read has some file format which specifies
    the encoding, or allows parametrization, such as XML or HTML.
    You will need to look *into* the file to find out what its
    encoding is.

2. Perhaps the data has some fixed encoding, as part of the file
    format specification. For many files, this is US-ASCII.

3. Perhaps this is a plain text file, and you should use the encoding
    that the user's text editor is most likely to use (of course, you
    don't know what text editor the user uses, nor what encoding that
    editor uses). locale.getdefaultlocale()[1] offers you some guess;
    python 2.3's locale.getpreferredencoding() gives a better guess.

> Is there any way to
> determine, from the unicode string itself, what encoding I need to use
> to prevent data loss?

That sounds you have the problem when *writing* Unicode strings.

In that case, you can invoke .encode: it will give a UnicodeError if
the encoding is not supported. At some point, you need to make up your
mind what encoding to use for a certain file - if you then get an error,
all you can do is to inform the user, and

a) perhaps ignore the bad characters, replacing them with appropriate
    replacement characters (usually '?'), or

b) go back and recode the output so far in a different encoding.

> Am I even asking the right questions?  I'm really pretty lost and my
> O'Reilly books arn't helping very much.

Don't worry. These things are inherently difficult. Organizations like
W3C have essentially given up, and say that XML is UTF-8 by default
(knowing that this will support arbitrary characters). If people 
absolutely want XML in different encodings, they can do that, but they
are left alone with the issue of encoding unsupported characters
(for XML, they can actually use character references).

You will have to make explicit choices: either support only UTF-8
(and accept that it will be tedious for some users to produce the proper
files), or support arbitrary encodings (and accept that some encodings
cannot represent all characters, and that you may not have the codecs
available to read the data, and that a mechanism must be provided to
determine the encooding), or support only a few non-UTF-8 encodings
(restricting the data format to a subset of all living languages).

Regards,
Martin