How to waste computer memory?

Steven D'Aprano steve at pearwood.info
Sat Mar 19 07:32:09 EDT 2016


On Sat, 19 Mar 2016 09:18 pm, Chris Angelico wrote:

> On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Unicode made several (understandable but grave) mistakes along the way:
>>
>>    * normalization
>>
> 
> Elaborate please? What's such a big mistake here?

As usual, Unicode problems are generally due to backwards compatibility.
Blame the old legacy encodings, which invented the "dead keys"
a.k.a. "combining character" technique. Of course, they had a reasonable
excuse at the time, but Unicode's requirement of being able to losslessly
handle all legacy character set standards means that Unicode has to provide
the same functionality.

The problem is not so much the existence of combining characters, but that
*some* but not all accented characters are available in two forms: a
composed single code point, and a decomposed pair of code points. This adds
complexity and means that equality of characters is not well-defined.
(Hence Unicode punts on the whole "character" thing and just talks about
code points.)



-- 
Steven




More information about the Python-list mailing list