How to waste computer memory?

Chris Angelico rosuav at gmail.com
Sat Mar 19 11:20:45 EDT 2016


On Sun, Mar 20, 2016 at 1:56 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Steven D'Aprano <steve at pearwood.info>:
>
>> On Sat, 19 Mar 2016 11:42 pm, Marko Rauhamaa wrote:
>>> When glorifying Python's advanced Unicode capabilities, are we
>>> careful to emphasize the necessity of unicodedata.normalize()
>>> everywhere? Should Python normalize strings unconditionally and
>>> transparently? What does the O(1) character lookup mean under
>>> normalization?
>>>
>>> Some weeks ago I had to spend 30 minutes to debug my Python program
>>> when a user complained it didn't work. Turns out they had
>>> accidentally invoked the program using a space and a composing tilde
>>> instead of the ASCII ~. There was no visual indication of a problem
>>> on the screen, but the Python program acted up.
>>
>> We recently had somebody here who wrote capital I by pressing the
>> lower case l on the keyboard. Should a pure-ASCII program be able to
>> operate without malfunction if the user confuses 0 and O, or I l and
>> 1? What about ' and ` or possibly even '' and "?
>
> What I'm talking about is that maybe Python should treat canonically
> equivalent strings equivalently, that is, indistinguishably under any
> external inspection.
>
> Anyway, Python's Unicode support is great thing, but Unicode is a big
> can of worms. Far from being a paradise, it's more of a case of picking
> your poison.

I don't believe they should be *automatically* equivalent. A Unicode
string is not a 2D collection of pixels, so it shouldn't be compared
for equality visually; nor should it automatically do other
transformations. The exact form of equivalence you want is the
application's choice, and there it should remain. You would be
absolutely *horrified* if Python started stripping leading/trailing
spaces from strings before comparing them, yet I have no doubt that
you've written programs that did exactly this. (And PHP does indeed to
transformations like this, unless you use the === operator. Out of
luck if you want to use <= or >= to order strings.) Some applications
will benefit from NFC normalization; others from NFKC. Keep it in the
application's hands, keep the language simple, and give the power to
the programmer.

Note, by the way, that the language itself does some normalization on
identifiers:

>>> exec("a\u0301 = 1234; print(\u00e1)")
1234

But programmer-controlled strings are, well, programmer-controlled.

ChrisA



More information about the Python-list mailing list