[Python-Dev] Unicode and Windows

M.-A. Lemburg mal@lemburg.com
Wed, 22 Mar 2000 12:04:32 +0100


Mark Hammond wrote:
> 
> >
> > Right. The idea with open() was to write a special version (using
> > #ifdefs) for use on Windows platforms which does all the needed
> > magic to convert Unicode to whatever the native format and locale
> > is...
> 
> That works for open() - but what about other extension modules?
> 
> This seems to imply that any Python extension on Windows that wants to pass
> a Unicode string to an external function can not use PyArg_ParseTuple() with
> anything other than "O", and perform the magic themselves.
> 
> This just seems a little back-to-front to me.  Platforms that have _no_
> native Unicode support have useful utilities for working with Unicode.
> Platforms that _do_ have native Unicode support can not make use of these
> utilities.  Is this by design, or simply a sad side-effect of the design?
> 
> So - it is trivial to use Unicode on platforms that dont support it, but
> quite difficult on platforms that do.

The problem is that Windows seems to use a completely different
internal Unicode format than most of the rest of the world.

As I've commented on in a different post, the only way to have
PyArg_ParseTuple() perform auto-conversion is by allowing it
to return objects which are garbage collected by the caller.
The problem with this is error handling, since PyArg_ParseTuple()
will have to keep track of all objects it created until the
call returns successfully. An alternative approach is sketched
below.

Note that *all* platforms will have to use this approach...
not only Windows or other platforms with Unicode support.

> > Using parser markers for this is obviously *not* the right way
> > to get to the core of the problem. Basically, you will have to
> > write a helper which takes a string, Unicode or some other
> > "t" compatible object as name object and then converts it to
> > the system's view of things.
> 
> Why "obviously"?  What on earth does the existing mechamism buy me on
> Windows, other than grief that I can not use it?

Sure, you can :-) Just fetch the object, coerce it to
Unicode and then encode it according to your platform needs
(PyUnicode_FromObject() takes care of the coercion part for you).
 
> > I think we had a private discussion about this a few months ago:
> > there was some way to convert Unicode to a platform independent
> > format which then got converted to MBCS -- don't remember the details
> > though.
> 
> There is a Win32 API function for this.  However, as you succinctly pointed
> out, not many people are going to be aware of its name, or how to use the
> multitude of flags offered by these conversion functions, or know how to
> deal with the memory management, etc.
> 
> > Can't you use the wchar_t interfaces for the task (see
> > the unicodeobject.h file for details) ? Perhaps you can
> > first transfer Unicode to wchar_t and then on to MBCS
> > using a win32 API ?!
> 
> Sure - I can.  But can everyone who writes interfaces to Unicode functions?
> You wrote the Python Unicode support but dont know its name - pity the poor
> Joe Average trying to write an extension.

Hey, Mark... I'm not a Windows geek. How can I know which APIs
are available and which of them to use ?

And that's my point: add conversion APIs and codecs for the different
OSes which make the extension writer life easier.
 
> It seems to me that, on Windows, the Python Unicode support as it stands is
> really internal.  I can not think of a single time that an extension writer
> on Windows would ever want to use the "t" markers - am I missing something?
> I dont believe that a single Unicode-aware function in the Windows
> extensions (of which there are _many_) could be changed to use the "t"
> markers.

"t" is intended to return a text representation of a buffer
interface aware type... this happens to be UTF-8 for Unicode
objects -- what other encoding would you have expected ?

> It still seems to me that the Unicode support works well on platforms with
> no Unicode support, and is fairly useless on platforms with the support.  I
> dont believe that any extension on Windows would want to use the "t"
> marker - so, as Fred suggested, how about providing something for us that
> can help us interface to the platform's Unicode?

That's exactly what I'm talking about all the time... 
there currently are PyUnicode_AsWideChar() and PyUnicode_FromWideChar()
to interface to the compiler's wchar_t type. I have no problem
adding more of these APIs for the various OSes -- but they
would have to be coded by someone with Unicode skills on each
of those platforms, e.g. PyUnicode_AsMBCS() and PyUnicode_FromMBCS()
on Windows.
 
> This is getting too hard for me - I will release my windows registry module
> without Unicode support, and hope that in the future someone cares enough to
> address it, and to add a large number of LOC that will be needed simply to
> get Unicode talking to Unicode...

I think you're getting this wrong: I'm not argueing against adding
better support for Windows.

The only way I can think of using parser markers in this context
would be by having PyArg_ParseTuple() *copy* data into a given
data buffer rather than only passing a reference to it. This
would enable PyArg_ParseTuple() to apply whatever conversion
is needed while still keeping the temporary objects internal.

Hmm, sketching a little:

"es#",&encoding,&buffer,&buffer_len
	-- could mean: coerce the object to Unicode, then
	   encode it using the given encoding and then 
	   copy at most buffer_len bytes of data into
	   buffer and update buffer_len to the number of bytes
	   copied

This costs some cycles for copying data, but gets rid off
the problems involved in cleaning up after errors. The
caller will have to ensure that the buffer is large enough
and that the encoding fits the application's needs. Error
handling will be poor since the caller can't take any
action other than to pass on the error generated by
PyArg_ParseTuple().

Thoughts ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/