Unicode normalisation [was Re: [beginner] What's wrong?]

Rustom Mody rustompmody at gmail.com
Fri Apr 8 00:43:01 EDT 2016


On Thursday, April 7, 2016 at 10:22:18 PM UTC+5:30, Peter Pearson wrote:
> On Thu, 07 Apr 2016 11:37:50 +1000, Steven D'Aprano wrote:
> > On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote:
> >> Rustom Mody wrote:
> >
> >>> So here are some examples to illustrate what I am saying:
> >>> 
> >>> Example 1 -- Ligatures:
> >>> 
> >>> Python3 gets it right
> >>>>>> flag = 1
> >>>>>> flag
> >>> 1
> [snip]
> >> 
> >> I do not think this is correct, though.  Different Unicode code sequences,
> >> after normalization, should result in different symbols.
> >
> > I think you are confused about normalisation. By definition, normalising
> > different Unicode code sequences may result in the same symbols, since that
> > is what normalisation means.
> >
> > Consider two distinct strings which nevertheless look identical:
> >
> > py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}"
> > py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"
> > py> a == b
> > False
> > py> print(a, b)
> > ü ü
> >
> >
> > The purpose of normalisation is to turn one into the other:
> >
> > py> unicodedata.normalize('NFKC', a) == b  # compose 2 code points --> 1
> > True
> > py> unicodedata.normalize('NFKD', b) == a  # decompose 1 code point --> 2
> > True
> 
> It's all great fun until someone loses an eye.
> 
> Seriously, it's cute how neatly normalisation works when you're
> watching closely and using it in the circumstances for which it was
> intended, but that hardly proves that these practices won't cause much
> trouble when they're used more casually and nobody's watching closely.
> Considering how much energy good software engineers spend eschewing
> unnecessary complexity, do we really want to embrace the prospect of
> having different things look identical?  (A relevant reference point:
> mixtures of spaces and tabs in Python indentation.)

That kind of sums up my position.
To be a casual user of unicode is one thing
To support it is another -- unicode strings in python3 -- ok so far
To mix up these two is a third without enough thought or consideration --
unicode identifiers is likely a security hole waiting to happen...

No I am not clever/criminal enough to know how to write a text that is visually
close to 
print "Hello World"
but is internally closer to
rm -rf /

For me this:
 >>> Α = 1
>>> A = 2
>>> Α + 1 == A 
True
>>> 


is cure enough that I am not amused

[The only reason I brought up case distinction is that this is in the same 
direction and way worse than that]

If python had been more serious about embracing the brave new world of
unicode it should have looked in this direction:
http://blog.languager.org/2014/04/unicoded-python.html

Also here I suggest a classification of unicode, that, while not
official or even formalizable is (I believe) helpful
http://blog.languager.org/2015/03/whimsical-unicode.html

Specifically as far as I am concerned if python were to throw back say
a ligature in an identifier as a syntax error -- exactly what python2 does --
I think it would be perfectly fine and a more sane choice



More information about the Python-list mailing list