Unicode normalisation [was Re: [beginner] What's wrong?]

Marko Rauhamaa marko at pacujo.net
Fri Apr 8 13:44:10 EDT 2016


Peter Pearson <pkpearson at nowhere.invalid>:

> On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano <steve at pearwood.info> wrote:
>> They are not, and never have been, in the typesetting business.
>> Perhaps characters are not the only things easily confused *wink*
>
> Defining codepoints that deal with appearance but not with meaning is
> going into the typesetting business. Examples: ligatures, and spaces
> of varying widths with specific typesetting properties like being
> non-breaking.
>
> Typesetting done in MS Word using such Unicode codepoints will never
> be more than a goofy approximation to real typesetting (e.g., TeX),
> but it will cost a huge amount of everybody's time, with the current
> discussion of ligatures in variable names being just a straw in the
> wind. Getting all the world's writing systems into a single, coherent
> standard was an extraordinarily ambitious, monumental undertaking, and
> I'm baffled that the urge to broaden its scope in this irrelevant
> direction was entertained at all.

I agree completely but at the same time have a lot of understanding for
the reasons why Unicode had to become such a mess. Part of it is
historical, part of it is political, yet part of it is in the
unavoidable messiness of trying to define what a character is.

For example, is "ä" one character or two: "a" plus "¨"? Is "i" one
character of two: "ı" plus "˙"? Is writing linear or two-dimensional?

Unicode heroically and definitively solved the problems ASCII had posed
but introduced a bag of new, trickier problems.

(As for ligatures, I understand that there might be quite a bit of
legacy software that dedicated code points and code pages for ligatures.
Translating that legacy software to Unicode was made more
straightforward by introducing analogous codepoints to Unicode. Unicode
has quite many such codepoints: µ, K, Ω etc.)


Marko



More information about the Python-list mailing list