[Python-Dev] Unicode and Windows

Tue, 21 Mar 2000 01:25:09 +0100

Mark Hammond wrote:
> 
> I would like to discuss Unicode on the Windows platform, and how it relates
> to MBCS that Windows uses.
> 
> My main goal here is to ensure that Unicode on Windows can make a round-trip
> to and from native Unicode stores.  As an example, let's take the registry -
> a Windows user should be able to read a Unicode value from the registry then
> write it back.  The value written back should be _identical_ to the value
> read.  Ditto for the file system: If the filesystem is Unicode, then I would
> expect the following code:
>   for fname in os.listdir():
>     f = open(fname + ".tmp", "w")
> 
> To create filenames on the filesystem with the exact base name even when the
> basename contains non-ascii characters.
> 
> However, the Unicode patches do not appear to make this possible.  open()
> uses PyArg_ParseTuple(args, "s...");  PyArg_ParseTuple() will automatically
> convert a Unicode object to UTF-8, so we end up passing a UTF-8 encoded
> string to the C runtime fopen function.

Right. The idea with open() was to write a special version (using
#ifdefs) for use on Windows platforms which does all the needed
magic to convert Unicode to whatever the native format and locale
is...

Using parser markers for this is obviously *not* the right way
to get to the core of the problem. Basically, you will have to
write a helper which takes a string, Unicode or some other
"t" compatible object as name object and then converts it to
the system's view of things.

I think we had a private discussion about this a few months ago:
there was some way to convert Unicode to a platform independent
format which then got converted to MBCS -- don't remember the details
though.

> The end result of all this is that we end up with UTF-8 encoded names in the
> registry/on the file system.  It does not seem possible to get a true
> Unicode string onto either the file system or in the registry.
> 
> Unfortunately, Im not experienced enough to know the full ramifications, but
> it _appears_ that on Windows the default "unicode to string" translation
> should be done via the WideCharToMultiByte() API.  This will then pass an
> MBCS encoded ascii string to Windows, and the "right thing" should magically
> happen.  Unfortunately, MBCS encoding is dependant on the current locale
> (ie, one MBCS sequence will mean completely different things depending on
> the locale).  I dont see a portability issue here, as the documentation
> could state that "Unicode->ASCII conversions use the most appropriate
> conversion for the platform.  If the platform is not Unicode aware, then
> UTF-8 will be used."

No, no, no... :-) The default should be (and is) UTF-8 on all platforms
-- whether the platform supports Unicode or not. If a platform
uses a different encoding, an encoder should be used which applies
the needed transformation.

> This issue is the final one before I release the win32reg module.  It seems
> _critical_ to me that if Python supports Unicode and the platform supports
> Unicode, then Python unicode values must be capable of being passed to the
> platform.  For the win32reg module I could quite possibly hack around the
> problem, but the more general problem (categorized by the open() example
> above) still remains...
> 
> Any thoughts?

Can't you use the wchar_t interfaces for the task (see
the unicodeobject.h file for details) ? Perhaps you can
first transfer Unicode to wchar_t and then on to MBCS
using a win32 API ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/