Unicode normalisation [was Re: [beginner] What's wrong?]

Steven D'Aprano steve at pearwood.info
Fri Apr 8 02:00:10 EDT 2016


On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote:

> Seriously, it's cute how neatly normalisation works when you're
> watching closely and using it in the circumstances for which it was
> intended, but that hardly proves that these practices won't cause much
> trouble when they're used more casually and nobody's watching closely.
> Considering how much energy good software engineers spend eschewing
> unnecessary complexity, 

Maybe so, but it's not good software engineers we have to worry about, but
the other 99.9% :-)


> do we really want to embrace the prospect of 
> having different things look identical?

You mean like ASCII identifiers? I'm afraid it's about fifty years too late
to ban identifiers using O and 0, or l, I and 1, or rn and m.

Or for that matter:

a = akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqwe9fhlcjbqvcbhsiauy37wkg() + 100
b = 100 + akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqew9fhlcjbqvcbhsiauy37wkg()

How easily can you tell them apart at a glance?

The reality is that we trust our coders not to deliberately mess us about.
As the Obfuscated C and the Underhanded C contest prove, you don't need
Unicode to hide hostile code. In fact, the use of Unicode confusables in an
otherwise all-ASCII file is a dead giveaway that something fishy is going
on.

I think that, beyond normalisation, the compiler need not be too concerned
by confusables. I wouldn't *object* to the compiler raising a warning if it
detected confusable identifiers, or mixed script identifiers, but I think
that's more the job for a linter or human code review.



> (A relevant reference point: 
> mixtures of spaces and tabs in Python indentation.)

Most editors have an option to display whitespace, and tabs and spaces look
different. Typically the tab is shown with an arrow, and the space by a
dot. If people *still* confuse them, the issue is easily managed by a
combination of "well don't do that" and TabError.


> [snip]
>> The Unicode consortium seems to disagree with you.
> 
> <cranky_geezer_font>
> 
> The Unicode consortium was certifiably insane when it went into the
> typesetting business.

They are not, and never have been, in the typesetting business. Perhaps
characters are not the only things easily confused *wink*

(Although some members of the consortium may be. But the consortium itself
isn't.)


> The pile-of-poo character was just frosting on 
> the cake.

Blame the Japanese mobile phone companies for that. When you pay your
membership fee, you get to object to the addition of characters too.
(Anyone, I think, can propose a new character, but only members get to
choose which proposals are accepted.)

But really, why should we object? Is "pile-of-poo" any more silly than any
of the other dingbats, graphics characters, and other non-alphabetical
characters? Unicode is not just for "letters of the alphabet".


-- 
Steven




More information about the Python-list mailing list