[Python-Dev] Python in Unicode context

Thu Aug 5 09:21:37 CEST 2004

François Pinard wrote:
> However, and I shall have the honesty to
> state it, this is *not* respectful of the general Unicode spirit: the
> Python implementation allows for independently addressable surrogate
> halves

This is only a problem if you have data which require surrogates (which
I claim are rather uncommon at the moment), and you don't have a UCS-4
build of Python (in which surrogates don't exist). As more users demand
convenient support for non-BMP characters, you'll find that more builds
of Python become UCS-4. In fact, you might find that the build you are
using already has sys.maxunicode > 65535.

> combining zero-width diacritics

Indeed. However, it is not clear to me how this problem could be
addressed, and I'm not aware of any API (any language) that addresses
it.

Typically, people need things like this:
- in a fixed-width terminal, what characters occupy what column.
   Notice that this involves East-Asian wide characters, where a single
   Unicode character (a "wide" character) occupies two columns. OTOH,
   with combining characters, a sequence of characters might be
   associated with a single column. Furthermore, some code points might
   not be associated with a column at all.
- for a given font, how many points does a string occupy, horizontally
   and vertically.
- where is the next word break

I don't know what your application is, but I somewhat doubt it is as
simple as "give me a thing describing the nth character, including
combining diacritics".

However, it is certainly possible to implement libraries on top of the
existing code, and if there is a real need for that, somebody will
contribute it.

 > normal _and_ decomposed forms,

Terminology alert: the are multiple normal forms in Unicode, and some
of them are decomposed (e.g. NFD, NFKD).

I fail to see a problem with that. There are applications for all
normal forms, and many applications don't need the overhead of
normalization. It might be that the code for your languages becomes
simpler when always assuming NFC, but this hardly holds for all
languages, or all applications.

> directional marks, linguistic marks and various other such complexities.

Same comment as above: if this becomes a real problem, people will
contribute code to deal with it.

> But in our case, where applications already work in Latin-1, abusing our
> Unicode luck, UTF-8 may _not_ be used as is, we ought to use Unicode or
> wide strings as well, for preserving S[N] addressability.  So changing
> source encodings may be intimately tied to going Unicode whenever UTF-8
> (or any other variable-length encoding) gets into the picture.

Yes. There is not much Python can do about this. UTF-8 is very nice for
transfer of character data, but it does have most of the problems of
a multi-byte encoding. I still prefer it over UTF-16 or UTF-32 for
transfer, though.

> I hope that my explanation above helps at seeing that source encoding
> and choice of string literals are not as independent as one may think.

It really depends on your processing needs. But yes, my advise still
stands: convert to Unicode objects as early as possible in the
processing. For source code involving non-ASCII characters, this means
you really should use Unicode literals.

Of course, my other advise also applies: if you have a program that
deals with multiple languages, use only ASCII in the source, and use
gettext for the messages.

> There ought to be a way to maintain a single Python source that would
> work dependably through re-encoding of the source, but not uselessly
> relying on wide strings when there is no need for them.  That is,
> without marking all literal strings as being Unicode.  Changing encoding
> from ISO 8859-1 to UTF-8 should not be a one-way, no-return ticket.

But it is not: as you say, you have to add u prefixes when going to 
UTF-8, yes. But then you can go back to Latin-1, with *no* change other
than recoding, and changing the encoding declaration. The string 
literals can all stay as Unicode literals - the conversion to Latin-1
then really has *no* effect on the runtime semantics.

> Of course, it is very normal that sources may have to be adapted for the
> possibility of a Unicode context.  There should be some good style and
> habits for writing re-encodable programs.  So this exchange of thoughts.

If that is the goal, you really need Unicode literals - everything else
*will* break under re-encoding.

Regards,
Martin