How to waste computer memory?

Marko Rauhamaa marko at pacujo.net
Sat Mar 19 08:42:45 EDT 2016


Steven D'Aprano <steve at pearwood.info>:

> As usual, Unicode problems are generally due to backwards
> compatibility. Blame the old legacy encodings, which invented the
> "dead keys" a.k.a. "combining character" technique. Of course, they
> had a reasonable excuse at the time, but Unicode's requirement of
> being able to losslessly handle all legacy character set standards
> means that Unicode has to provide the same functionality.

The combining characters allow for maze of twisty little combinations,
all alike. There's no limit to the number of diacritics you can pile on,
under and next to the base character.

Was that universality unavoidable? Maybe it was. Deep down, all scripts
are two-dimensional.

> The problem is not so much the existence of combining characters, but that
> *some* but not all accented characters are available in two forms: a
> composed single code point, and a decomposed pair of code points.

Also, is an a with ring on top and another ring on bottom the same
character as an a with ring on bottom and another ring on top?

> This adds complexity and means that equality of characters is not
> well-defined. (Hence Unicode punts on the whole "character" thing and
> just talks about code points.)

The problem is not theoretical. If I implement a web form and someone
enters "Aña" as their name, how do I make sure queries find the name
regardless of the unicode code point sequence? I have to normalize using
unicodedata.normalize().

When glorifying Python's advanced Unicode capabilities, are we careful
to emphasize the necessity of unicodedata.normalize() everywhere? Should
Python normalize strings unconditionally and transparently? What does
the O(1) character lookup mean under normalization?

Some weeks ago I had to spend 30 minutes to debug my Python program when
a user complained it didn't work. Turns out they had accidentally
invoked the program using a space and a composing tilde instead of the
ASCII ~. There was no visual indication of a problem on the screen, but
the Python program acted up.


Marko



More information about the Python-list mailing list