How to waste computer memory?

Steven D'Aprano steve at pearwood.info
Sat Mar 19 10:39:47 EDT 2016


On Sat, 19 Mar 2016 11:42 pm, Marko Rauhamaa wrote:

> The problem is not theoretical. If I implement a web form and someone
> enters "Aña" as their name, how do I make sure queries find the name
> regardless of the unicode code point sequence? I have to normalize using
> unicodedata.normalize().

I didn't say that it was theoretical. It is a real problem, but it is a
problem with human languages: the number of characters-with-accents is
vast, possibly impossibly vast. They can't all have unique code points.

I must admit I had completely missed your example of multiple combining
characters, that's a good one. Here's the example again:

a + combining ring above + combining ring below, versus
a + combining ring below + combining ring above

Naturally just comparing them gives unequal:

py> s = "a\u030A\u0325"
py> t = "a\u0325\u030A"
py> s == t
False


But we can normalise them:

====  =============  =============  ==================  =================
Form  NFC            NFKC           NFKD                NFKD
====  =============  =============  ==================  =================
s     U+1E01,030A    U+1E01,030A    U+0061,0325,030A    U+0061,0325,030A
t     U+1E01,030A    U+1E01,030A    U+0061,0325,030A    U+0061,0325,030A
====  =============  =============  ==================  =================


As you can see, *any* of the normalisation forms will put the code points
into the same, canonical order, making them equal.


> When glorifying Python's advanced Unicode capabilities, are we careful
> to emphasize the necessity of unicodedata.normalize() everywhere? Should
> Python normalize strings unconditionally and transparently? What does
> the O(1) character lookup mean under normalization?
> 
> Some weeks ago I had to spend 30 minutes to debug my Python program when
> a user complained it didn't work. Turns out they had accidentally
> invoked the program using a space and a composing tilde instead of the
> ASCII ~. There was no visual indication of a problem on the screen, but
> the Python program acted up.

We recently had somebody here who wrote capital I by pressing the lower case
l on the keyboard. Should a pure-ASCII program be able to operate without
malfunction if the user confuses 0 and O, or I l and 1? What about ' and `
or possibly even '' and "?


-- 
Steven




More information about the Python-list mailing list