[I18n-sig] Passing unicode strings to file system calls

M.-A. Lemburg mal@lemburg.com
Wed, 17 Jul 2002 18:50:59 +0200


Bleyer, Michael wrote:
> Assume I have a list of unicode strings in UTF-16-le. Reading and parsing
> the list all works really fine.
> 
> Now I want to create/copy a number of files and I want the file/directory
> names to be these unicode strings.
> When I give a unicode string to a file system call like
> shutil.copy()
> or 
> os.makedir()
> Python converts the unicode string to a "regular" string using the default
> site encoding (which usually fails if 'ascii').
> I can influence this by encode()'ing myself before I pass the string to the
> system function call, so far so good.
> 
> However, I do have a problem if I have unicode strings from different,
> non-compatible encodings in my list (e.g. ISO latin-1 and some asian
> encoding), as I cannot use the same encoding conversion for all strings,
> some will fail. I can of course convert to UTF8 which will always work, but
> the filenames turn out to be garbage (because the OS does not interpret them
> as UTF8 but in the local encoding).
> 
> My question is thus: since modern-day operating systems claim to support
> unicode (I assume) in filenames, how do I pass a unicode string directly to
> a system function call without having to convert to a "localized" encoding?

Python 2.2 tries to automagically encode Unicode into the
encoding used by the OS. This only works if Python can figure
out this encoding. AFAIK, only Windows platforms are supported.

> Alternatively how can I find out the "proper" or "legal" encoding for a
> unicode string just by looking at the string (e.g. not with a brute force
> try-encode-except trial and error loop).

If you know the encoding used by the file system, then you should
simply encode the Unicode filename using that encoding.

> As a side problem: how do I deal with filename length limits, since these
> are actually byte limits not character limits?
> If I do a u''[:255] followed by an encode I end up with a unicode string
> thats at most 255 characters long, but may be longer than 255 bytes after
> encoding.
> If I do encode followed by ''[:255] I get at most 255 bytes but my string
> may be illegal because I cut off in the middle of a 3-byte character.

Good question. You could try the stripping after the encoding
and then have Python decode the result using the 'ignore' error
handling. That should give you the maximum sized Unicode string
to use for encoding.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/