[Python-Dev] PEP 277 (unicode filenames): please review
Jack Jansen
Jack.Jansen@oratrix.com
Tue, 13 Aug 2002 23:14:30 +0200
On dinsdag, augustus 13, 2002, at 03:01 , Guido van Rossum wrote:
>
> Looks like it isn't you: the filename somehow contains a character
> that's not in the Latin-1 subset of Unicode, and no encoding can fix
> that for you. I don't know why -- you'll have to figure out why your
> keyboard generates that character when you type o-umlaut.
No, it's the way the filesystem stores filenames, apparently.=20
Or, at least, it's the way the filesystem API's expose those=20
filenames. Here's a session again (this time I'm using the=20
terminal in utf-8 mode):
>>> x =3D "fr\xc3\xb6r"
>>> os.listdir(".")
['.DS_Store']
>>> open(x, "w")
<open file 'fr=F6r', mode 'w' at 0x130838>
>>> os.listdir(".")
['.DS_Store', 'fro\xcc\x88r']
>>> os.path.exists('fro\xcc\x88r')
True
>>> os.path.exists("fr\xc3\xb6r")
True
If I create a file with an o-umlaut it gets decomposed into an o=20
and a combining umlaut.
[Jack goes off and wrestles his way through a gazillion websites=20
with Unicode information]
If I understand the unicode standard (according to unicode.org)=20
correctly this means that MacOS stores filenames in NFD=20
normalized form, with all combining characters split out, and=20
this is the preferred normalized form. Am I correct here?
But, even if NFC is the preferred normalized form (the documents=20
I saw hinted that this may have been the case in previous=20
Unicode standards:-): both NFC and NFD renditions of this string=20
are legal unicode, aren't they? And if they are then both should=20
be converted to the same latin-1 string, shouldn't they?
Do I misunderstand something, or this this a bug (limitation?)=20
in the unicode->latin-1 decoder?
--
- Jack Jansen <Jack.Jansen@oratrix.com> =20
http://www.cwi.nl/~jack -
- If I can't dance I don't want to be part of your revolution --=20
Emma Goldman -