[I18n-sig] Passing unicode strings to file system calls

Bleyer, Michael MBleyer@DEFiNiENS.com
Thu, 18 Jul 2002 15:42:06 +0200


> > Python 2.2 tries to automagically encode Unicode into the encoding 
> > used by the OS. This only works if Python can figure out this 
> > encoding. AFAIK, only Windows platforms are supported.
> 
> No; it works on Unix as well (if nl_langinfo(CODEPAGE) is 
> supported); you need to invoke setlocale to activate this 
> support (in particular, the LC_CTYPE category).

It does work on Unix as well with some caveats. However, I think maybe my
original questions was not clear enough.

Let's assume for the sake of the argument, that a call to
locale.getdefaultlocale()[1]
will get me the systems default encoding which I can use to encode my
unicode strings so they show up properly when used in filenames.

But I know that in some areas people work with two different incompatible
(non-symmetric) encodings, for example people in Japan with mixed Sun and
Windows networks. They have some filenames in one encoding and some in the
other. One half of the filenames used always shows up as garbage, since they
cannot be displayed in the other encoding and vice versa. Let's assume that
people know this and accept it.

What I want to do, is create file names from a list that has strings in both
encodings. The strings can be handled fine while in unicode, but as soon as
I try to convert all of them to one encoding, half of the conversions will
fail. I just want to convert them with the proper encoding and then pass the
bytestring to the system function, I don't worry about wether it will
_display_ right, just about wether the name is correct. 

What I would like to have is some function that will tell me for a given
Unicode string, a list of all the encodings that this string can be
converted into (without having to try all available encodings in a brute
force loop), because I do not know the proper encoding a priori.

The system locale info will only tell me which encoding is _displayed_
properly, it does not mean that this encoding will be able to handle all my
unicode strings.

I am not sure if this is a fundamental problem with Unicode, as it seems to
be a great way to store data but as soon as you actually want to do anything
with it you need some extra META information that is not stored in the data
itself (uhm, I don't mean to rant, I realize this holds true for other
formats as well). I also know that an obvious answer to this whole issue
could be "just keep your data in your local encoding and avoid using
unicode", unfortunately the source format is UTF-16.

Anyway, if there isn't a direct interface/solution, what would you consider
the best workaround for Python?
:-)

Mike