Unicode normalisation [was Re: [beginner] What's wrong?]

Fri Apr 8 13:51:17 EDT 2016

On Friday, April 8, 2016 at 10:24:17 AM UTC+5:30, Chris Angelico wrote:
> On Fri, Apr 8, 2016 at 2:43 PM, Rustom Mody  wrote:
> > No I am not clever/criminal enough to know how to write a text that is visually
> > close to
> > print "Hello World"
> > but is internally closer to
> > rm -rf /
> >
> > For me this:
> >  >>> Α = 1
> >>>> A = 2
> >>>> Α + 1 == A
> > True
> >>>>
> >
> >
> > is cure enough that I am not amused
> 
> To me, the above is a contrived example. And you can contrive examples
> that are just as confusing while still being ASCII-only, like
> swimmer/swirnmer in many fonts, or I and l, or any number of other
> visually-confusing glyphs. I propose that we ban the letters 'r' and
> 'l' from identifiers, to ensure that people can't mess with
> themselves.

swirnmer and swimmer are distinguished by squiting a bit
А and A only by digging down into the hex.
If you categorize them as similar/same... well I am not arguing...
will come to you when I am short of straw...

> 
> > Specifically as far as I am concerned if python were to throw back say
> > a ligature in an identifier as a syntax error -- exactly what python2 does --
> > I think it would be perfectly fine and a more sane choice
> 
> The ligature is handled straight-forwardly: it gets decomposed into
> its component letters. I'm not seeing a problem here.

Yes... there is no problem... HERE [I did say python gets this right that
haskell for example gets wrong]
Whats wrong is the whole approach of swallowing gobs of characters that
need not be legal at all and then getting indigestion:

Note the "non-normative" in
https://docs.python.org/3/reference/lexical_analysis.html#identifiers

If a language reference is not normative what is?