How to waste computer memory?

Marko Rauhamaa marko at pacujo.net
Sat Mar 19 07:07:59 EDT 2016


Chris Angelico <rosuav at gmail.com>:

> On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Unicode made several (understandable but grave) mistakes along the way:
>>
>>    * normalization
>
> Elaborate please? What's such a big mistake here?

Unicode shouldn't have allowed multiple equivalent variants for a
string.

Now Python falls victim to:

   >>> '\u006e\u0303' == '\u00f1'
   False

<URL: https://en.wikipedia.org/wiki/Unicode_equivalence>:

   For example, the code point U+006E (the Latin lowercase "n") followed
   by U+0303 (the combining tilde "◌̃") is defined by Unicode to be
   canonically equivalent to the single code point U+00F1 (the lowercase
   letter "ñ" of the Spanish alphabet). Therefore, those sequences
   should be displayed in the same manner, should be treated in the same
   way by applications such as alphabetizing names or searching, and may
   be substituted for each other.


Marko



More information about the Python-list mailing list