Unicode normalisation [was Re: [beginner] What's wrong?]

Peter Pearson pkpearson at nowhere.invalid
Thu Apr 7 12:51:54 EDT 2016


On Thu, 07 Apr 2016 11:37:50 +1000, Steven D'Aprano wrote:
> On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote:
>> Rustom Mody wrote:
>
>>> So here are some examples to illustrate what I am saying:
>>> 
>>> Example 1 -- Ligatures:
>>> 
>>> Python3 gets it right
>>>>>> flag = 1
>>>>>> flag
>>> 1
[snip]
>> 
>> I do not think this is correct, though.  Different Unicode code sequences,
>> after normalization, should result in different symbols.
>
> I think you are confused about normalisation. By definition, normalising
> different Unicode code sequences may result in the same symbols, since that
> is what normalisation means.
>
> Consider two distinct strings which nevertheless look identical:
>
> py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}"
> py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"
> py> a == b
> False
> py> print(a, b)
> ü ü
>
>
> The purpose of normalisation is to turn one into the other:
>
> py> unicodedata.normalize('NFKC', a) == b  # compose 2 code points --> 1
> True
> py> unicodedata.normalize('NFKD', b) == a  # decompose 1 code point --> 2
> True

It's all great fun until someone loses an eye.

Seriously, it's cute how neatly normalisation works when you're
watching closely and using it in the circumstances for which it was
intended, but that hardly proves that these practices won't cause much
trouble when they're used more casually and nobody's watching closely.
Considering how much energy good software engineers spend eschewing
unnecessary complexity, do we really want to embrace the prospect of
having different things look identical?  (A relevant reference point:
mixtures of spaces and tabs in Python indentation.)

[snip]
> The Unicode consortium seems to disagree with you.

<cranky_geezer_font>

The Unicode consortium was certifiably insane when it went into the
typesetting business.  The pile-of-poo character was just frosting on
the cake.

</cranky_geezer_font>

(Sorry to leave you with that image.)

-- 
To email me, substitute nowhere->runbox, invalid->com.



More information about the Python-list mailing list