Unicode normalisation [was Re: [beginner] What's wrong?]

Wed Apr 6 21:37:50 EDT 2016

On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote:

> Rustom Mody wrote:

>> So here are some examples to illustrate what I am saying:
>> 
>> Example 1 -- Ligatures:
>> 
>> Python3 gets it right
>>>>> ﬂag = 1
>>>>> flag
>> 1

Python identifiers are intentionally normalised to reduce security issues,
or at least confusion and annoyance, due to visually-identical identifiers
being treated as different.

Unicode has technical standards dealing with identifiers:

http://www.unicode.org/reports/tr31/

and visual spoofing and confusables:

http://www.unicode.org/reports/tr39/

I don't believe that CPython goes to the full extreme of checking for mixed
script confusables, but it does partially mitigate the problem by
normalising identifiers.

Unfortunately PEP 3131 leaves a number of questions open. Presumably they
were answered in the implementation, but they aren't documented in the PEP.

https://www.python.org/dev/peps/pep-3131/

> Fascinating; confirmed with
> 
> | $ python3
> | Python 3.4.4 (default, Jan  5 2016, 15:35:18)
> | [GCC 5.3.1 20160101] on linux
> | […]
> 
> I do not think this is correct, though.  Different Unicode code sequences,
> after normalization, should result in different symbols.

I think you are confused about normalisation. By definition, normalising
different Unicode code sequences may result in the same symbols, since that
is what normalisation means.

Consider two distinct strings which nevertheless look identical:

py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}"
py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"
py> a == b
False
py> print(a, b)
ü ü

The purpose of normalisation is to turn one into the other:

py> unicodedata.normalize('NFKC', a) == b  # compose 2 code points --> 1
True
py> unicodedata.normalize('NFKD', b) == a  # decompose 1 code point --> 2
True

In the case of the fl ligature, normalisation splits the ligature into
individual 'f' and 'l' code points regardless of whether you compose or
decompose:

py> unicodedata.normalize('NFKC', "ﬂag") == "flag"
True
py> unicodedata.normalize('NFKD', "ﬂag") == "flag"
True

That's using the combatability composition form. Using the default
composition form leaves the ligature unchanged.

Note that UTS #39 (security mechanisms) suggests that identifiers should be
normalised using NFKC.

[...]
> I think Haskell gets it right here, while Py3k does not.  The “ﬂ” is not
> to be decomposed to “fl”.

The Unicode consortium seems to disagree with you. Table 1 of UTS #39 (see
link above) includes "Characters that cannot occur in strings normalized to
NFKC" in the Restricted category, that is, characters which should not be
used in identifiers. ﬂ cannot occur in such normalised strings, and so it
is classified as Restricted and should not be used in identifiers.

I'm not entirely sure just how closely Python's identifiers follow the
standard, but I think that the intention is to follow something close to
"UAX31-R4. Equivalent Normalized Identifiers":

http://www.unicode.org/reports/tr31/#R4

[Rustom] 
>> Python gets it wrong
>>>>> a=1
>>>>> A
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> NameError: name 'A' is not defined
> 
> This is not wrong; it is just different.

I agree with Thomas here. Case-insensitivity is a choice, and I don't think
it is a good choice for programming identifiers. Being able to make case
distinctions between (let's say):

SPAM  # a constant, or at least constant-by-convention
Spam  # a class or type
spam  # an instance

is useful.

[Rustom]
>> With ASCII the problems are minor: Case-distinct identifiers are distinct
>> -- they dont IDENTIFY.
> 
> I do not think this is a problem.
> 
>> This contradicts standard English usage and practice
> 
> No, it does not.

I agree with Thomas here too. Although it is rare for case to make a
distinction in English, it does happen. As the old joke goes:

Capitalisation is the difference between helping my Uncle Jack off a horse,
and helping my uncle jack off a horse.

So even in English, capitalisation can make a semantic difference.

-- 
Steven