Unicode normalisation [was Re: [beginner] What's wrong?]

Fri Apr 8 14:44:18 EDT 2016

On Sat, 9 Apr 2016 03:21 am, Peter Pearson wrote:

> On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano <steve at pearwood.info>
> wrote:
>> On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote:
>>> 
>>> The Unicode consortium was certifiably insane when it went into the
>>> typesetting business.
>>
>> They are not, and never have been, in the typesetting business. Perhaps
>> characters are not the only things easily confused *wink*
> 
> Defining codepoints that deal with appearance but not with meaning is
> going into the typesetting business.  Examples: ligatures, and spaces of
> varying widths with specific typesetting properties like being
> non-breaking.

Both of which are covered by the requirement that Unicode is capable of
representing legacy encodings/code pages.

Examples: MacRoman contains fl and fi ligatures, and NBSP. 

Non-breaking space is not so much a typesetting property as a semantic
property, that is, it deals with *meaning* (exactly what you suggested it
doesn't deal with). It is a space which doesn't break words.

Ligatures are a good example -- the Unicode consortium have explicitly
refused to add other ligatures beyond the handful needed for backwards
compatibility because they maintain that it is a typesetting issue that is
best handled by the font. There's even a FAQ about that very issue, and I
quote:

"The existing ligatures exist basically for compatibility and round-tripping
with non-Unicode character sets. Their use is discouraged. No more will be
encoded in any circumstances."

http://www.unicode.org/faq/ligature_digraph.html#Lig2

Unicode currently contains something of the order of one hundred and ten
thousand defined code points. I'm sure that if you went through the entire
list, with a sufficiently loose definition of "typesetting", you could
probably find some that exist only for presentation, and aren't covered by
the legacy encoding clause. So what? One swallow does not mean the season
is spring. Unicode makes an explicit rejection of being responsible for
typesetting. See their discussion on presentation forms:

http://www.unicode.org/faq/ligature_digraph.html#PForms

But I will grant you that sometimes there's a grey area between presentation
and semantics, and the Unicode consortium has to make a decision one way or
another. Those decisions may not always be completely consistent, and may
be driven by political and/or popular demand.

E.g. the Consortium explicitly state that stylistic issues such as bold,
italic, superscript etc are up to the layout engine or markup, and
shouldn't be part of the Unicode character set. They insist that they only
show representative glyphs for code points, and that font designers and
vendors are free (within certain limits) to modify the presentation as
desired. Nevertheless, there are specialist characters with distinct
formatting, and variant selectors for specifying a specific glyph, and
emoji modifiers for specifying skin tone.

But when you get down to fundamentals, character sets and alphabets have
always blurred the line between presentation and meaning. W ("double-u")
was, once upon a time, UU and & (ampersand) started off as a ligature
of "et" (Latin for "and"). There are always going to be cases where
well-meaning people can agree to disagree on whether or not adding the
character to Unicode was justified or not.

-- 
Steven