unicode filenames

Alex Martelli aleax at aleax.it
Mon Feb 3 05:39:34 EST 2003


Neil Hodgson wrote:

> Alex Martelli:
> 
>> Similar considerations apply for any other multibyte encoding
>> (such as, UTF-8) that is NOT specifically and carefully
>> designed to avoid ever needing a byte of value 47 (0x2F) in
>> order to represent ANY character except a slash.  I am not
>> aware of any such multi-byte encoding -- there may be some,
>> but, even if one can be found, using it would still fall WELL
>> short of "any other encoding whatsoever" as you claimed.
> 
>    UTF-8 is a superset of ASCII. A slash has the same representation in
> UTF-8 as ASCII. No multi-byte UTF-8 character may contain a byte < 128.

Ah!  Wonderful, thanks -- and clearly this was one crucial
point I was missing: UTF-8 *IS* "specifically and carefully 
designed to avoid ever needing a byte of value 47 (0x2F) in 
order to represent ANY character except a slash" (among
other things;-), and therefore _IS_ usable as the encoding
of Unicode names on a non-Unicode-aware Unix system.

I think it's still true that this doesn't apply to other
multi-byte encodings, and therefore it's misleading to claim
that applications can decide on any such encoding (as Erik
did), but, given the _ability_ to use UTF-8, this only
means each application _should_ use UTF-8 rather than
other encodings (if it needs to be able to represent all
Unicode characters, rather than, say, just the subset of
them that's in Latin-1) in this context.


Thanks!


Alex






More information about the Python-list mailing list