Unicode normalisation [was Re: [beginner] What's wrong?]

Fri Apr 8 14:17:53 EDT 2016

On Friday, April 8, 2016 at 11:33:38 PM UTC+5:30, Peter Pearson wrote:
> On Sat, 9 Apr 2016 03:50:16 +1000, Chris Angelico wrote:
> > On Sat, Apr 9, 2016 at 3:44 AM, Marko Rauhamaa  wrote:
> [snip]
> >> (As for ligatures, I understand that there might be quite a bit of
> >> legacy software that dedicated code points and code pages for ligatures.
> >> Translating that legacy software to Unicode was made more
> >> straightforward by introducing analogous codepoints to Unicode. Unicode
> >> has quite many such codepoints: µ, K, Ω etc.)
> >
> > More specifically, Unicode solved the problems that *codepages* had
> > posed. And one of the principles of its design was that every
> > character in every legacy encoding had a direct representation as a
> > Unicode codepoint, allowing bidirectional transcoding for
> > compatibility. Perhaps if Unicode had existed from the dawn of
> > computing, we'd have less characters; but backward compatibility is
> > way too important to let a narrow purity argument sway it.
> 
> I guess with that historical perspective the current situation
> seems almost inevitable.  Thanks.  And thanks to Steven D'Aprano
> for other relevant insights.

Strange view
In fact the unicode standard itself encourages not using the standard in its
entirety

5.12 Deprecation

In the Unicode Standard, the term deprecation is used somewhat differently than it is in some other standards. Deprecation is used to mean that a character or other feature is strongly discouraged from use. This should not, however, be taken as indicating that anything has been removed from the standard, nor that anything is planned for removal from the standard. Any such change is constrained by the Unicode Consortium Stability Policies [Stability].

For the Unicode Character Database, there are two important types of deprecation to be noted. First, an encoded character may be deprecated. Second, a character property may be deprecated.

When an encoded character is strongly discouraged from use, it is given the property value Deprecated=True. The Deprecated property is a binary property defined specifically to carry this information about Unicode characters. Very few characters are ever formally deprecated this way; it is not enough that a character be uncommon, obsolete, disliked, or not preferred. Only those few characters which have been determined by the UTC to have serious architectural defects or which have been determined to cause significant implementation problems are ever deprecated. Even in the most severe cases, such as the deprecated format control characters (U+206A..U+206F), an encoded character is never removed from the standard. Furthermore, although deprecated characters are strongly discouraged from use, and should be avoided in favor of other, more appropriate mechanisms, they may occur in data. Conformant implementations of Unicode processes such a Unicode normalization must handle even deprecated characters correctly.

I read this as saying that -- in addition to officially deprecated chars --
there ARE "uncommon, obsolete, disliked, or not preferred" chars
which sensible users should avoid using even though unicode as a standard is
compelled to keep supporting

Which translates into
- python as a language *implementing* unicode (eg in strings) needs to
do it completely if it is to be standard compliant
- python as a *user* of unicode (eg in identifiers) can (and IMHO should)
use better judgement