Preserving unicode filename encoding

Julien Phalip jphalip at gmail.com
Sat Oct 20 16:43:16 EDT 2012


Hi,

I've noticed that the encoding of non-ascii filenames can be inconsistent between platforms when using the built-in open() function to create files.

For example, on a Ubuntu 10.04.4 LTS box, the character u'ş' (u'\u015f') gets encoded as u'ş' (u's\u0327'). Note how the two characters look exactly the same but are encoded differently. The original character uses only one code (u'\u015f'), but the resulting character that is saved on the file system will be made of a combination of two codes: the letter 's' followed by a diacritical cedilla (u's\u0327'). (You can learn more about diacritics in [1]). On the Mac, however, the original encoding is always preserved.

This issue was also discussed in a blog post by Ned Batchelder [2]. One suggested approach is to normalize the filename, however this could result in loss of information (what if, for example, the original filename did contain combining diacritics and we wanted to preserve them).

Ideally, it would be preferable to preserve the original encoding. Is that possible or is that completely out of Python's control?

Thanks a lot,

Julien

[1] http://en.wikipedia.org/wiki/Combining_diacritic#Unicode_ranges
[2] http://nedbatchelder.com/blog/201106/filenames_with_accents.html



More information about the Python-list mailing list