[Python-Dev] PEP 383 (again)

Tue Apr 28 14:41:19 CEST 2009

Lino Mastrodomenico wrote:
> Let's suppose that I use Python 2.x or something else to create a file
> with name b'\xff'. My (Linux) system has a sane configuration and the
> filesystem encoding is UTF-8, so it's an invalid name but the kernel
> will blindly accept it anyway.
> 
> With this PEP, Python 3.1 listdir() will convert b'\xff' to the string '\udcff'.

One question that really bothers me about this proposal is the following:

Assume a UTF-8 locale.  A file named b'\xff', being an invalid UTF-8 
sequence, will be converted to the half-surrogate '\udcff'.  However, a 
file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be 
converted to '\udcff'.  Those are quite different POSIX pathnames; how 
will Python know which one it was when I later pass '\udcff' to open()?

A poster hinted at this question, but I haven't seen it answered, yet.

[1]
I'm assuming that it's valid UTF8 because it passes through Python 2.5's 
'\xed\xb3\xbf'.decode('utf-8').  I don't claim to be a UTF-8 expert.