[Python-Dev] Re: [Python-checkins]python/dist/src/Objects unicodeobject.c, 2.197, 2.198

Sun Sep 21 10:48:21 EDT 2003

[Tim]
>> So what if MAL ammened his suggestion to
>>
>>     reject signed 2-byte wchar_t value as not-usable
>> +++++++ ?

[M.-A. Lemburg]
> That would not solve the problem.

Then what is the problem, specifically?  I thought you agreed with Martin
that a signed 32-bit type doesn't hurt, since the sign bit remains clear
then in all cases of Unicode data.

> Note that we have proper conversion routines that allow
> converting between wchar_t and Py_UNICODE. These routines must
> be used for conversions anyway (even if Py_UNICODE and wchar_t
> happen to be the same type), so from a programmer perspective
> changing Py_UNICODE to be unsigned won't be noticed and we
> don't lose anything much.
>
> Again, I don't see the point in using a signed type for data
> that doesn't have any concept of signed values. It's just
> bad design and we shouldn't try to go down the same route
> if we don't have to.

I don't know why Martin favors wchar_t when possible.  The answer to that
isn't clear.  The answer to why there's an intractable problem if wchar_t
happens to be a signed type > 2 bytes also isn't clear.

> The Unicode implementation has always defined Py_UNICODE to
> be an unsigned type; see the Unicode PEP 100:
>
> """
> Internal Format
>
>      The internal format for Unicode objects should use a Python
>      specific fixed format <PythonUnicode> implemented as 'unsigned
>      short' (or another unsigned numeric type having 16 bits).  Byte
>      order is platform dependent.
>
> ...
>
>      The configure script should provide aid in deciding whether
>      Python can use the native wchar_t type or not (it has to be a
>      16-bit unsigned type).
> """
>
> Python can also deal with UCS4 now, but the concept remains the
> same.

Well, it doesn't have to be a 16-bit type either, even in a UCS2 build, and
we had a long argument about that one before, because a particular Cray
system didn't have any 16-bit type and the Unicode code wasn't working
there.  That got repaired when I rewrote the few bits of code that assumed
"exactly 16 bits" to live with the weaker "at least 16 bits".

In this iteration, Martin agreed that a signed 16-bit wchar_t can be
rejected.  The question remaining is what actual problem exists when there's
a signed wchar_t exceeding 16 bits.  Since Jeremy is running on exactly such
a system, and the tests pass for him, there's no *obvious* problem with it
(the segfault he experienced was due to reading uninitialized memory, and
that was a bug, and that's been fixed).