[Python-Dev] Filename as byte string in python 2.6 or 3.0?

Adam Olsen rhamph at gmail.com
Tue Sep 30 01:31:45 CEST 2008


On Mon, Sep 29, 2008 at 4:49 PM, "Martin v. Löwis" <martin at v.loewis.de> wrote:
>> Originally I thought that this was a valid idea, but then it became
>> clear that this could be a problem.  Consider a filename which includes
>> a UTF-8 encoding of a PUA code point.
>
> I still think it's a valid idea. For non-UTF-8 file system encodings,
> use PUA characters, and generate them through an error handler.
>
> If the file system encoding is UTF-8, use UTF-8b instead as the
> file system encoding.
>
>> Viewing the PUA with GNOME charmap, I can see that many code points
>> there have character renderings on my Ubuntu system.  I have to assume,
>> therefore, that there are other (and potentially conflicting) uses for
>> this unicode feature.
>
> Depends on how you use it. If you use the PUA block 1 (i.e.
> U+E000..U+F8FF), there is a realistic chance of collision.
>
> If you use the Plane 15 or Plane 16 PUA blocks, there is currently
> zero chance of collision (AFAIK). PUA has a wide use for additional
> characters in TrueType, but I don't think many tools even support
> plane 15 and 16 for generating fonts, or rendering them (it may even
> that the TrueType/OpenType format doesn't support them in the first
> place). However, Python can make use of these planes fairly easily,
> even in 2-byte mode (through UTF-16).

An example where lossy conversion fails:

1) create file using UTF-8 app with PUA (or ambiguous scalar of
choice) filename.
2) list dir in python.  file name is now a unicode object with PUA.
3) attempt to open.  file name gets converted to malformed UTF-8
sequence.  Doesn't match the name on disk, so opening fails

Lossy conversion just moves around what gets treated as garbage.  As
all valid unicode scalars can be round tripped, there's no way to
create a valid unicode file name without being lossy.  The alternative
is not be valid unicode, but since we can't use such objects with
external libs, can't even print them, we might as well call them
something else.  We already have a name for that: bytes.


-- 
Adam Olsen, aka Rhamphoryncus


More information about the Python-Dev mailing list