Determining Unicode encoding.

Tue Apr 29 17:33:15 EDT 2003

Quoth Sean:
> I'm really new to dealing with unicode, so please bear with me.  I'm
> trying to add unicode support to a program I'm working on, and I'm
> getting stuck a little when printing a unicode string to a file.  I
> know I have to encode the string using an encoding (UTF-8, UTF-16,
> latin-1, etc).  The problem is that I don't know how to determine what
> the *right* encoding to use on a particular string is.  The way I
> understand it, utf-8 will handle any unicode data, but it will
> translate characters not in the standard ASCII set to fit within the
> 8-bit character table.  [...]

Actually characters outside ASCII get turned into multibyte
sequences by UTF-8.  To use an example that came up here a little
while back:

    >>> u = u'\N{DEGREE SIGN}' # U+00B0; not in ASCII
    >>> u
    u'\xb0'
    >>> u.encode('utf-8') # two-byte sequence
    '\xc2\xb0'

UTF-8 is capable of representing any Unicode character without
information loss, so if you need to deal with arbitrary Unicode
it's a good choice.  (It also has other pleasant properties.)

> [...] My problem is I'm handling data from a lot of
> different encodings (latin, eastern, asian, etc) and I can't allow
> data in the strings to be changed.  I also can't (at least I don't
> know how to) determine what encodings the strings are using.  IE, I
> don't know what strings are from what languages.  [...]

I think I detect a confusion or two here.

To encode is to turn (a sequence of) characters into (a sequence
of) bytes, and to decode is the reverse.  An encoding is a scheme
for doing these things; it need not be strongly associated with
any particular language.  (Though often an encoding can only
represent certain characters, and as a result is only useful for
those languages which use just those characters.)

A Unicode string (that is, a Python object of type 'unicode', such
as u'foo') is a sequence of characters, not bytes.  It therefore
is not in any particular encoding.

A normal string (that is, a Python object of type 'str', such as
'foo') is a sequence of bytes, not characters.  It can be
interpreted as a sequence of characters only if an encoding is
used to decode it.  (Usually ASCII is assumed if you do not
specify one explicitly.)

So, if you have a bunch of normal strings and don't know what
encodings they're in, you're hooped.  But if you have a bunch of
Unicode strings, it doesn't make sense to ask what encodings
they're in.

Now, as for not allowing the data in your strings to be changed:
If you mean you need to preserve the same sequence of characters,
then it's okay to change the encoding.  You'll almost certainly
want the file you produce to be all in one encoding, so you'll
want an encoding which can represent any character you might
encounter -- UTF-8, for example.

But if you mean you need to preserve the exact byte sequences,
you're hooped.  (Unless you'd be happy to have the output file be
a mishmash of different encodings and so virtually unusable.)

> [...] Is there any way to
> determine, from the unicode string itself, what encoding I need to use
> to prevent data loss?   Or do I need to find a way to determine
> beforehand what encoding they are using when they are read in?

You will have to know the encoding at the input stage; use that
information to decode the bytes into a Unicode string.  Then
assemble the Unicode strings to be output and encode them in, for
example, UTF-8.

In principle you might be able to look at the characters in a
Unicode string and determine some "least encoding" which could
deal with all the characters in it.  But there's not much point to
this, imho; just use UTF-8, which can handle anything.

-- 
Steven Taschuk                                     staschuk at telusplanet.net
Receive them ignorant; dispatch them confused.  (Weschler's Teaching Motto)