[Python-Dev] Re:
[Python-checkins]python/dist/src/Objects unicodeobject.c, 2.197, 2.198
Tim Peters
tim.one at comcast.net
Sun Sep 21 10:48:21 EDT 2003
[Tim]
>> So what if MAL ammened his suggestion to
>>
>> reject signed 2-byte wchar_t value as not-usable
>> +++++++ ?
[M.-A. Lemburg]
> That would not solve the problem.
Then what is the problem, specifically? I thought you agreed with Martin
that a signed 32-bit type doesn't hurt, since the sign bit remains clear
then in all cases of Unicode data.
> Note that we have proper conversion routines that allow
> converting between wchar_t and Py_UNICODE. These routines must
> be used for conversions anyway (even if Py_UNICODE and wchar_t
> happen to be the same type), so from a programmer perspective
> changing Py_UNICODE to be unsigned won't be noticed and we
> don't lose anything much.
>
> Again, I don't see the point in using a signed type for data
> that doesn't have any concept of signed values. It's just
> bad design and we shouldn't try to go down the same route
> if we don't have to.
I don't know why Martin favors wchar_t when possible. The answer to that
isn't clear. The answer to why there's an intractable problem if wchar_t
happens to be a signed type > 2 bytes also isn't clear.
> The Unicode implementation has always defined Py_UNICODE to
> be an unsigned type; see the Unicode PEP 100:
>
> """
> Internal Format
>
> The internal format for Unicode objects should use a Python
> specific fixed format <PythonUnicode> implemented as 'unsigned
> short' (or another unsigned numeric type having 16 bits). Byte
> order is platform dependent.
>
> ...
>
> The configure script should provide aid in deciding whether
> Python can use the native wchar_t type or not (it has to be a
> 16-bit unsigned type).
> """
>
> Python can also deal with UCS4 now, but the concept remains the
> same.
Well, it doesn't have to be a 16-bit type either, even in a UCS2 build, and
we had a long argument about that one before, because a particular Cray
system didn't have any 16-bit type and the Unicode code wasn't working
there. That got repaired when I rewrote the few bits of code that assumed
"exactly 16 bits" to live with the weaker "at least 16 bits".
In this iteration, Martin agreed that a signed 16-bit wchar_t can be
rejected. The question remaining is what actual problem exists when there's
a signed wchar_t exceeding 16 bits. Since Jeremy is running on exactly such
a system, and the tests pass for him, there's no *obvious* problem with it
(the segfault he experienced was due to reading uninitialized memory, and
that was a bug, and that's been fixed).
More information about the Python-Dev
mailing list