[Python-3000] Unicode IDs -- why NFC? Why allow ligatures?

Jim Jewett jimjjewett at gmail.com
Tue Jun 5 04:37:31 CEST 2007


Ligatures, such as IJ and ij (unicode 0x0132, 0x0133) are considered
acceptable identifier characters unless explicitly tailored out.
(They appear in both ID and XID)

Do we really want this, or should we assume that ij and ij should be
equivalent?  If so, then we need to enforce this somehow.

To me, this suggests that we should use the NFKD form.  Examples at
http://www.unicode.org/reports/tr15/tr15-28.html show that only the
Decomposition forms split fi (ligature 0xFB01) into the constituents f
and i.  Kompatibility form is needed to merge characters that are "the
same" except for some presentational quirk, such as being
superscripted or half-width.

The PEP assumes NFC, but I haven't really understood why, unless that
is required for compatibility with other systems (in which case, it
should be made explicit).

-jJ


More information about the Python-3000 mailing list