[Python-Dev] Unicode support in getargs.c

Sun, 06 Jan 2002 17:58:41 +0100

Jack Jansen wrote:

> I'm going to jump out of this discussion for a while. Martin and Mark have 
> a completely different view on Unicode than I do, apparently, and I think 
> I should first try and see if I can use the current implementation.

 >

> For the record: my view of Unicode is really "ascii done right", i.e. a 
> datatype that allows you to get richer characters than what 1960s ascii 
> gives you. For this it should be as backward-compatible as possible, i.e. 
> if some API expects a unicode filename and I pass "a.out" it should 
> interpret it as u"a.out". All the converting to different charsets is 
> icing on the cake, the number one priority should be that unicode is as 
> compatible as possible with the 8-bit convention used on the platform 
> (whatever it may be). No, make that the number 2 priority: the number one 
> pritority is compatibility with 7-bit ascii. Using Python StringObjects as 
> binary buffers is also far less common than using StringObjects to store 
> plain old strings, so if either of these uses bites the other it's the 
> binary buffer that needs to suffer. UnicodeObjects and StringObjects 
> should behave pretty orthogonal to how FloatObjects and IntObjects behave.

It would be nice if Unicode could be made to behave that way,
but unfortunately, the 8-bit world is so differentiated with
lots of different encodings that not even Harry Potter would
have much luck finding the right magic to apply.

Another problem is that of the getargs.c API itself: since it returns

pointers to data buffers, auto-conversions (if at all possible)
which involve temporary objects must be handled differently than
normal Python string objects.

Now, the question is whether you are willing to pay for the
comfort of getting direct access to a Py_UNICODE buffer (or char
buffer) with extra copy-action and additional PyMem_Free() cleanup
overhead or not. The "O" parser marker doesn't provide any
magic on its own, but also reduces the need for copying data
and handling memory management in you APIs.

In my last message on this thread, I proposed to add "eu#" which
returns a Py_UNICODE buffer, possibly decoding a string object
using the given encoding first. As Martin noted, this option
requires extra copying but simplifies the C coding somewhat.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/