[I18n-sig] Passing unicode strings to file system calls

Martin v. Loewis martin@v.loewis.de
17 Jul 2002 20:25:30 +0200


"Bleyer, Michael" <MBleyer@DEFiNiENS.com> writes:

> My question is thus: since modern-day operating systems claim to support
> unicode (I assume) in filenames

That is not really true. WinNT and MacOS do. Unix only supports
byte-based file names, and there is an ongoing debate on how those
should be used to represent non-ASCII in file names. The convention
seems to be that the locale's encoding should be assumed for file
names.

As MAL explains, you can pass Unicode file names automatically in
Python 2.2; you might need to invoke locale.setlocale for this to work
properly.

> Alternatively how can I find out the "proper" or "legal" encoding for a
> unicode string just by looking at the string (e.g. not with a brute force
> try-encode-except trial and error loop).

For this, you need to tell us what system you use.

> As a side problem: how do I deal with filename length limits, since these
> are actually byte limits not character limits?

Again, depends on the system. As a starting point, you need to find
out what the limit is.

> If I do a u''[:255] followed by an encode I end up with a unicode string
> thats at most 255 characters long, but may be longer than 255 bytes after
> encoding.

Also, the limit might be smaller than 255.

> If I do encode followed by ''[:255] I get at most 255 bytes but my string
> may be illegal because I cut off in the middle of a 3-byte character.

If truncation is acceptable, I recommend to truncate to 50% of the
maximum size, and assert that the encoded result is smaller than the
maximum size. You can try to be smart and use binary search to find
the largest acceptable character string.

Regards,
Martin