[Python-3000] Unicode IDs -- why NFC? Why allow ligatures?

Wed Jun 6 06:19:33 CEST 2007

Jim Jewett writes:

 > On 6/5/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
 > 
 > > It seems to me that what UAX#31 is saying is "Distinguishing (or not)
 > > between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be
 > > equivalent to distinguishing (or not) between LATIN CAPITAL
 > > LETTER A and LATIN SMALL LETTER A."  I don't know that
 > > I agree (or disagree) in principle.
 > 
 > So effectively, they consider "a" and "A" to be presentational variants.

Well, no, they're pretty explicit that they have semantic content, as
do superscripts.  This is different from the Arabic initial, medial,
and final forms, ligatures, the Croatian digraphs, and the Japanese
double-byte ASCII, where there is no semantic content (not even word
division for Arabic AFAIK), use is just required by "the rules" (for
Arabic) or is 100% at the discretion of the user (ASCII variants).

 > In some languages, certain presentational variants are used depending
 > on word position.  I think the ID_START property does exclude letters
 > that cannot appear in an initial position, but putting a final
 > character in the middle or vice versa would still be wrong.

Good point.  I'm going to interview some Arabic speakers who I believe
have some programming skills; I'll add that to the list.

 > If identifiers are built up in the equivalent of
 > 
 >     handler="do_" + name

I think this is pretty likely, and one of the attractions of languages
like Python.

 > The folding rules do say that it is OK  (even good) to exclude certain
 > characters from certain foldings; I think we could preserve case
 > (including title-case?) as the only presentational variant we
 > recognize.

AFAICS from looking at the V2 table, case is an *analogy* used by
UAX#31 to clarify when NKFC is useful.  NKFC itself does not fold
case, it is considered appropriate if you have a language that folds
case anyway.

 > http://www.unicode.org/versions/corrigendum3.html suggests that many
 > of the Hangul are either pronunciation guide variants or even exact
 > duplicates (that were presumably missed when the canonicalization was
 > frozen?)

I'll have to ask some Koreans what they would use.

 > """It is recommended that all Arabic presentation forms be excluded
 > from identifiers in any event, although only a few of them must be
 > excluded for normalization to guarantee identifier closure."""

Cool.  I'll ask that, too.

 > Depends on what you mean by technical symbols.

Eg, the letterlike symbols (DEGREE CELSIUS), the number forms (ROMAN
NUMERAL ONE), and the APL set (2336--237A) in the BMP.  [[ I really
need to put together some tools to access that database from
XEmacs.... ]]

 > IMHO, many of them are in fact listed as ID characters.  The math
 > versions (generally 1D400 - 1DC7B) are included.  But
 > http://unicode.org/reports/tr39/data/xidmodifications.txt suggests
 > excluding them again.

I'm not really worried about people using characters outside the BMP
very often, any more than people use an embedded comma in LISP
identifiers or file names (eg RCS ,v), unless they use a script lately
admitted to Unicode, or if they just wish to tempt the wrath of the
gods.  The former will not have a problem, and the latter can look out
for themselves, I'm sure.