[Python-Dev] PEP 277 (unicode filenames): please review

Jack Jansen Jack.Jansen@oratrix.com
Tue, 13 Aug 2002 23:14:30 +0200


On dinsdag, augustus 13, 2002, at 03:01 , Guido van Rossum wrote:
>
> Looks like it isn't you: the filename somehow contains a character
> that's not in the Latin-1 subset of Unicode, and no encoding can fix
> that for you.  I don't know why -- you'll have to figure out why your
> keyboard generates that character when you type o-umlaut.

No, it's the way the filesystem stores filenames, apparently.=20
Or, at least, it's the way the filesystem API's expose those=20
filenames. Here's a session again (this time I'm using the=20
terminal in utf-8 mode):

 >>> x =3D "fr\xc3\xb6r"
 >>> os.listdir(".")
['.DS_Store']
 >>> open(x, "w")
<open file 'fr=F6r', mode 'w' at 0x130838>
 >>> os.listdir(".")
['.DS_Store', 'fro\xcc\x88r']
 >>> os.path.exists('fro\xcc\x88r')
True
 >>> os.path.exists("fr\xc3\xb6r")
True

If I create a file with an o-umlaut it gets decomposed into an o=20
and a combining umlaut.

[Jack goes off and wrestles his way through a gazillion websites=20
with Unicode information]

If I understand the unicode standard (according to unicode.org)=20
correctly this means that MacOS stores filenames in NFD=20
normalized form, with all combining characters split out, and=20
this is the preferred normalized form. Am I correct here?

But, even if NFC is the preferred normalized form (the documents=20
I saw hinted that this may have been the case in previous=20
Unicode standards:-): both NFC and NFD renditions of this string=20
are legal unicode, aren't they? And if they are then both should=20
be converted to the same latin-1 string, shouldn't they?

Do I misunderstand something, or this this a bug (limitation?)=20
in the unicode->latin-1 decoder?
--
- Jack Jansen        <Jack.Jansen@oratrix.com>       =20
http://www.cwi.nl/~jack -
- If I can't dance I don't want to be part of your revolution --=20
Emma Goldman -