Python Unicode handling wins again -- mostly

Sun Dec 1 11:57:44 EST 2013

Le dimanche 1 décembre 2013 00:07:36 UTC+1, Ned Batchelder a écrit :
> On 11/30/13 5:37 PM, Gregory Ewing wrote:
> 
> > wxjmfauth at gmail.com wrote:
> 
> >> And do you know the origin of this typographical feature?
> 
> >> Because, mechanically, the dot of the "i" broke too often.
> 
> >>
> 
> >> In my opinion, a very plausible explanation.
> 
> >
> 
> > It doesn't sound very plausible to me, because there
> 
> > are a lot more stand-alone 'i's in English text than
> 
> > there are ones following an f. What is there to stop
> 
> > them from breaking?
> 
> >
> 
> > It's more likely to be simply a kerning issue. You
> 
> > want to get the stems of the f and the i close together,
> 
> > and the only practical way to do that with mechanical
> 
> > type is to merge them into one piece of metal.
> 
> >
> 
> > Which makes it even sillier to have an 'ffi' character
> 
> > in this day and age, when you can simply space the
> 
> > characters so that they overlap.
> 
> >
> 
> 
> 
> The fi ligature was created because visually, an f and i wouldn't work 
> 
> well together: the crossbar of the f was near, but not connected to the 
> 
> serif of the i, and the terminal bulb of the f was close to, but not 
> 
> coincident, with the dot of the i.
> 
> 
> 
> This article goes into great detail, and has a good illustration of how 
> 
> an f and i can clash, and how an fi ligature can fix the problem: 
> 
> http://opentype.info/blog/2012/11/20/whats-a-ligature/ . Note the second 
> 
> fi illustration, which demonstrates using a ligature to make the letters 
> 
> appear *less* connected than they would individually!
> 
> 
> 
> This is also why "simply spacing the characters" isn't a solution: a 
> 
> specially designed ligature looks better than a separate f and i, no 
> 
> matter how minutely kerned.
> 
> 
> 
> It's unfortunate that Unicode includes presentation alternatives like 
> 
> the fi (and ff, fl, ffi, and fl) ligatures.  It was done to be a 
> 
> superset of existing encodings.
> 
> 
> 
> Many typefaces have other non-encoded ligatures as well, especially 
> 
> display faces, which also have alternate glyphs.  Unicode is a funny mix 
> 
> in that it includes some forms of alternates, but can't include all of 
> 
> them, so we have to put up with both an ad-hoc Unicode that includes 
> 
> presentational variants, and also some other way to specify variants 
> 
> because Unicode can't include all of them.
> 

I'm speaking about those times where the "characters" (some) were
not even built with metal, but with wood (see Garamond, Bodoni).

---------

Unicode is "only" collecting "characters" in the sense "abstract
entities". What is supposed to be a "character" is one problem.
How a tool is supposed to handle these "characters" is a problem
too, but a different one.

"Unicode" is not a coding scheme, it is a "repertoire".

Illustrative examples instead of explanations.

The ffl ligature is a "character" because it has always
existed.

The & and œ are considered today as unique "characters".
They were historically "ligaturated forms".

The Fahrenheit, Kelvin and Celsius are considered as
"characters", despite Fahrenheit, Kelvin are "letters".

Text justification. Calculating the space between "words"
in "rendering units" makes sense. Using a specific "character"
like a thin space to force a predefined space makes sense too.

The miscellaneous zeroes one may see, like uppercase O, O with
a dot in the center or a striked O are all the same zero, but
with stylistic variants, => a single "character" in the unicode
table.

... but this medieval "character" existing in two forms (I do not
remember which one) was finally registrated as two "characters",
and not as a stylistic variant of a single "character".

There are no "characters" for the symbols of the chemical elements,
a latin script is good enough.

The QPlainTextEdit widget from Qt does not know '\n'. It uses
only the paragraph separator and the line separator. To render
a paragraph separator, it uses one another "character", the
pilcrow.

The µ "character" in the iso-8859-1 coding scheme is a greek
letter, it must be used or percieved as a SI unit prefix.
Unicode category: Ll, unicode name: micro sign.

How to place an arrow (vector) on top of an ê, if one cann't
decompose it?

Related, there are dotless variants of i and j.

STIX fonts with the huge number of math symbols, not
yet in the unicode repertoire but present in the PUA.

etc.

Unicode is quite open. It's a good idea to keep that
openess to the developer. Shortly, if a coder decomposes
a "character" like "â" in a "a" plus a "^", it's up to
the developer to know what to do when reversing such a
string and to count this sequence as two real "characters".

jmf