Unicode normalisation [was Re: [beginner] What's wrong?]

Fri Apr 8 02:13:07 EDT 2016

On Fri, Apr 8, 2016 at 4:00 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> Or for that matter:
>
> a = akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqwe9fhlcjbqvcbhsiauy37wkg() + 100
> b = 100 + akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqew9fhlcjbqvcbhsiauy37wkg()
>
> How easily can you tell them apart at a glance?

Ouch! Can't even align them top and bottom. This is evil.

> I think that, beyond normalisation, the compiler need not be too concerned
> by confusables. I wouldn't *object* to the compiler raising a warning if it
> detected confusable identifiers, or mixed script identifiers, but I think
> that's more the job for a linter or human code review.

The compiler should treat as identical anything that an editor should
reasonably treat as identical. I'm not sure whether multiple combining
characters on a single base character are forced into some order prior
to comparison or are kept in the order they were typed, but my gut
feeling is that they should be considered identical.

> They are not, and never have been, in the typesetting business. Perhaps
> characters are not the only things easily confused *wink*

Peter is definitely a character. So are you. QUITE a character. :)

> But really, why should we object? Is "pile-of-poo" any more silly than any
> of the other dingbats, graphics characters, and other non-alphabetical
> characters? Unicode is not just for "letters of the alphabet".

It's less silly than "ZERO-WIDTH NON-BREAKING SPACE", which isn't a
space at all, it's a joiner. Go figure.

(History's a wonderful thing, ain't it? So's backward compatibility
and a guarantee that names will never be changed.)

ChrisA