[Python-Dev] PEP 277 (unicode filenames): please review

M.-A. Lemburg mal@lemburg.com
Tue, 13 Aug 2002 23:02:01 +0200


Martin v. Loewis wrote:
> Guido van Rossum <guido@python.org> writes:
> 
> 
>>>It could be that Apple is decomposing the filenames before comparing
>>>them. Either way works.
>>
>>Hm, that sucks (either way) -- because you get unnormalized Unicode
>>out of directory listings, which is harder to turn into local
>>encodings.
> 
> 
> Notice that, most likely, Apple *does* normalize them - they just use
> Normal Form D (which favours decomposition, instead of using
> precomposed characters) - this is what Apple apparently calls
> "canonical".

Both the decomposition and the composition are called "canonical" --
simply because both operations lead to predefined results (those
defined by the Unicode database).

http://www.unicode.org/unicode/reports/tr15/

has all the details.

As always with Unicode, things are slightly more complicated than
what people are normally used to (but for good reasons). The introduction
of that tech report describes these things in details. Canonical
equivalence basically means that the graphemes for the Unicode
code points when rendered look the same to the user -- even though
the code point combinations may be different.

Normalization takes care of mapping this visual equivalence to
an algorithm.

Now, if the OS uses canonical equivalence to find file names,
then all possible combinations of code points resulting in the
same sequence of graphemes will give you a match; for a good
reason: because the user of a GUI file manager wouldn't be
able to distinguish between two canonically equivalent file
names.

> That choice is not surprising - NFD is "more logical", as precomposed
> characters are available only arbitrarily (e.g. the WITH TILDE
> combinations exist for a, i, e, n, o, u, v, y, but not for, say, x).

... but in a well-defined manner and that's what's important.

> The Unicode FAQ
> (http://www.unicode.org/unicode/faq/normalization.html) says
> 
> Q: Which forms of normalization should I support?
> 
> A: The choice of which to use depends on the particular program or
> system.  The most commonly supported form is NFC, since it is more
> compatible with strings converted from legacy encodings. This is also
> the choice for the web, as per the recommendations in "Character Model
> for the World Wide Web" from the W3C. The other normalization forms
> are useful for other domains.
> 
> So I guess Python should atleast provide NFC - precisely because of
> the legacy encodings.

At least is good :-) NFC is NFD + canonical composition. Decomposition
isn't all that hard (using unicodedata.decomposition()). For
composition the situation is different: not all information is
available in the unicodedata database (the exclusion list) and
the database also doesn't provide the reverse mapping from
decomposed code points to composed one. See the Annexes to the
tech report to get an impression of just how hard combining is...

Still, would be nice to have (written in C for speed, since
this would be a very common operation). Zope Corp. will certainly
be interested in this for Zope3 ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/