Python and UTF-8

Thu Jan 3 12:58:58 EST 2002

Matthias Huening <mhuening at zedat.fu-berlin.de> writes:

> > You have to know the encoding the data is currently, say
> > current_encoding. Then, converting it into UTF-8, you write
> > 
> > data = unicode(data, current_encoding).encode('utf-8')
> > 
> 
> Yes, but what if I don't know? 

If you get byte data from some source, and want to interpret those
byte data as character strings, you *have* to know the encoding - if
you don't, consider re-architecting your application so that you do.
If you still cannot know, guess. If you guess wrong often enough, your
users will complain so that they are willing to accept additional
infrastructure to properly identify the encoding of byte data.

> How does Python handle Unicode-files?

There is no such thing as a Unicode file. Files are byte-oriented on
all systems I know. So when opening a file, you need to specify the
encoding. You can use codecs.open to read from a file and get Unicode
strings out of it.

> How does sorting work with Unicode?

By default, it sorts by Unicode numeral value.

> Can I use locales with Unicode (e.g. to sort words according to the
> German convention?) How?

You sort plain (byte) strings according to locale with
locale.strcoll. In theory, this function ought to work for Unicode
strings, too; it is a bug that it currently doesn't.

To work around this, you need to encode the Unicode strings into the
locale's character set, and compare the resulting byte strings with
strcoll.

> How to use regular expressions with Unicode?

Just use the re module: it fully supports Unicode.

Regards,
Martin