[Python-Dev] Unicode and Windows
M.-A. Lemburg
mal@lemburg.com
Tue, 21 Mar 2000 01:25:09 +0100
Mark Hammond wrote:
>
> I would like to discuss Unicode on the Windows platform, and how it relates
> to MBCS that Windows uses.
>
> My main goal here is to ensure that Unicode on Windows can make a round-trip
> to and from native Unicode stores. As an example, let's take the registry -
> a Windows user should be able to read a Unicode value from the registry then
> write it back. The value written back should be _identical_ to the value
> read. Ditto for the file system: If the filesystem is Unicode, then I would
> expect the following code:
> for fname in os.listdir():
> f = open(fname + ".tmp", "w")
>
> To create filenames on the filesystem with the exact base name even when the
> basename contains non-ascii characters.
>
> However, the Unicode patches do not appear to make this possible. open()
> uses PyArg_ParseTuple(args, "s..."); PyArg_ParseTuple() will automatically
> convert a Unicode object to UTF-8, so we end up passing a UTF-8 encoded
> string to the C runtime fopen function.
Right. The idea with open() was to write a special version (using
#ifdefs) for use on Windows platforms which does all the needed
magic to convert Unicode to whatever the native format and locale
is...
Using parser markers for this is obviously *not* the right way
to get to the core of the problem. Basically, you will have to
write a helper which takes a string, Unicode or some other
"t" compatible object as name object and then converts it to
the system's view of things.
I think we had a private discussion about this a few months ago:
there was some way to convert Unicode to a platform independent
format which then got converted to MBCS -- don't remember the details
though.
> The end result of all this is that we end up with UTF-8 encoded names in the
> registry/on the file system. It does not seem possible to get a true
> Unicode string onto either the file system or in the registry.
>
> Unfortunately, Im not experienced enough to know the full ramifications, but
> it _appears_ that on Windows the default "unicode to string" translation
> should be done via the WideCharToMultiByte() API. This will then pass an
> MBCS encoded ascii string to Windows, and the "right thing" should magically
> happen. Unfortunately, MBCS encoding is dependant on the current locale
> (ie, one MBCS sequence will mean completely different things depending on
> the locale). I dont see a portability issue here, as the documentation
> could state that "Unicode->ASCII conversions use the most appropriate
> conversion for the platform. If the platform is not Unicode aware, then
> UTF-8 will be used."
No, no, no... :-) The default should be (and is) UTF-8 on all platforms
-- whether the platform supports Unicode or not. If a platform
uses a different encoding, an encoder should be used which applies
the needed transformation.
> This issue is the final one before I release the win32reg module. It seems
> _critical_ to me that if Python supports Unicode and the platform supports
> Unicode, then Python unicode values must be capable of being passed to the
> platform. For the win32reg module I could quite possibly hack around the
> problem, but the more general problem (categorized by the open() example
> above) still remains...
>
> Any thoughts?
Can't you use the wchar_t interfaces for the task (see
the unicodeobject.h file for details) ? Perhaps you can
first transfer Unicode to wchar_t and then on to MBCS
using a win32 API ?!
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/