[Python-Dev] Unicode support in getargs.c

M.-A. Lemburg mal@lemburg.com
Wed, 02 Jan 2002 00:42:21 +0100


Jack Jansen wrote:

> I posted a question on Unicode support in getargs.c last month (working
> on a different project), but now that I'm trying to support
> unicode-based APIs more seriously I find that it leaves even more to be
> desired. I'd like to help to fix this, but I need some direction on
> how things should be fixed.
> 
> Here are some of the issues I ran in today:
> - Unicode objects have a companion string object, meaning that you can
>   pass a unicode object to an "s" format and have the right thing happen.
>   String objects have no such accompanying unicode object, and I think they
>   should have. Right now you cannot pass a string object when the C
>   routine expects a unicode object.


You can: parse the object and then pass it to
PyUnicode_FromObject().


> - There is no unicode equivalent of "c", the single character.
> - "u#" does something useful, but something completely different from
>   what "s#" does. More to the point, it probably does something
>   dangerous, if I understand correctly. If I write a C routine with an
>   "u#" format and the Python code passes a string object the string object
>   will be used as a buffer object and its binary contents will be interpreted
>   as unicode. If the argument in question is a filename this will produce
>   very surprising results:-)


True; "u#" does exactly the same as "s#" -- it interprets the
input as binary buffer.


> I'd like unicode objects to be get a little more first class citizenship,
> especially in the light of operating systems that are primarily (or
> exclusively) unicode based, such as Mac OS X or Windows CE, to sum things up.


You would be far better off using the Unicode API on the
objects which are passed into the function rather than relying on
the getargs parser to try to apply some magic to the input
objects.

It might be worthwhile extending the parser markers a bit
more or allowing e.g. introduce "us#" to return Unicode objects
much like "es#" returns strings... I think we'd need some examples
of use though before deciding what's the right way to do this
("es#" was implemented after an request by Mark Hammond to
be able to handle Unicode file names for Win CE).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/