[Python-3000] Filename: unicode normalization

Victor Stinner victor.stinner at haypocalc.com
Wed Oct 1 01:11:10 CEST 2008


Since it's hard to follow the filename thread on two mailing list, i'm 
starting a new thread only on python-3000 about unicode normalization of the 
filenames.

Bad news: it looks like Linux doesn't normalize filenames. So if you used NFC 
to create a file, you have to reuse NFC to open your file (and the same for 
NFD).

Python2 example to create files in the different forms:
>>> name=u'xäx'
>>> from unicodedata import normalize
>>> open(u'NFD-' + normalize('NFD', name), 'w').close()
>>> open(u'NFC-' + normalize('NFC', name), 'w').close()
>>> open(u'NFKC-' + normalize('NFKC', name), 'w').close()
>>> open(u'NFKD-' + normalize('NFKD', name), 'w').close()
>>> import os
>>> os.listdir('.')
['NFD-xa\xcc\x88x', 'NFC-x\xc3\xa4x', 'NFKC-x\xc3\xa4x', 'NFKD-xa\xcc\x88x']
>>> os.listdir(u'.')
[u'NFD-xa\u0308x', u'NFC-x\xe4x', u'NFKC-x\xe4x', u'NFKD-xa\u0308x']

Directory listing using Python3:
>>> import os
>>> [ name.encode('utf-8') for name in  os.listdir('.') ]
[b'NFD-xa\xcc\x88x', b'NFC-x\xc3\xa4x', b'NFKC-x\xc3\xa4x', 
b'NFKD-xa\xcc\x88x']
>>> os.listdir('.')
['NFD-xäx', 'NFC-xäx', 'NFKC-xäx', 'NFKD-xäx']

Same results, correct. Then try to open files:
>>> open(normalize('NFC', 'NFC-xäx')).close()
>>> open(normalize('NFD', 'NFC-xäx')).close()
IOError: [Errno 2] No such file or directory: 'NFC-xäx'
>>> open(normalize('NFD', 'NFD-xäx')).close()
>>> open(normalize('NFC', 'NFD-xäx')).close()
IOError: [Errno 2] No such file or directory: 'NFD-xäx'

If the user chooses a result from os.listdir(): no problem (if he has good 
eyes and he's able to find the difference between 'xäx' (NFD) and 'xäx' 
(NFC) :-D).

If the user enters the filename using the keyboard (on the command line or a 
GUI dialog), you have to hope that the keyboard is encoded in the same norm 
than the filename was encoded...

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/


More information about the Python-3000 mailing list