[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tue Apr 28 15:01:32 CEST 2009

2009/4/28 Glenn Linderman <v+python at g.nevcal.com>:
> The switch from PUA to half-surrogates does not resolve the issues with the
> encoding not being a 1-to-1 mapping, though.  The very fact that you  think
> you can get away with use of lone surrogates means that other people might,
> accidentally or intentionally, also use lone surrogates for some other
> purpose.  Even in file names.

It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
not a valid Unicode character (not a character at all, really) and the
only way you can put this in a POSIX filename is if you use a very
lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.

Since this byte sequence doesn't represent a valid character when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').

Martin: maybe the PEP should say this explicitly?

Note that the round-trip works without ambiguities between '\udcff' in
the filename:

b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf'

and b'\xff' in the filename, decoded by Python to '\udcff':

b'\xff' -> '\udcff' -> b'\xff'

-- 
Lino Mastrodomenico