[Python-Dev] unicode vs buffer (array) design issue can crash interpreter

Thu Apr 13 12:20:49 CEST 2006

Neal Norwitz wrote:
> On 3/31/06, M.-A. Lemburg <mal at egenix.com> wrote:
>> Martin v. Löwis wrote:
>>> Neal Norwitz wrote:
>>>> See http://python.org/sf/1454485 for the gory details.  Basically if
>>>> you create a unicode array (array.array('u')) and try to append an
>>>> 8-bit string (ie, not unicode), you can crash the interpreter.
>>>>
>>>> The problem is that the string is converted without question to a
>>>> unicode buffer.  Within unicode, it assumes the data to be valid, but
>>>> this isn't necessarily the case.  We wind up accessing an array with a
>>>> negative index and boom.
>>> There are several problems combined here, which might need discussion:
>>>
>>> - why does the 'u#' converter use the buffer interface if available?
>>>   it should just support Unicode objects. The buffer object makes
>>>   no promise that the buffer actually is meaningful UCS-2/UCS-4, so
>>>   u# shouldn't guess that it is.
>>>   (FWIW, it currently truncates the buffer size to the next-smaller
>>>    multiple of sizeof(Py_UNICODE), and silently so)
>>>
>>>   I think that part should just go: u# should be restricted to unicode
>>>   objects.
>> 'u#' is intended to match 's#' which also uses the buffer
>> interface. It expects the buffer returned by the object
>> to a be a Py_UNICODE* buffer, hence the calculation of the
>> length.
>>
>> However, we already have 'es#' which is a lot safer to use
>> in this respect: you can explicity define the encoding you
>> want to see, e.g. 'unicode-internal' and the associated
>> codec also takes care of range checks, etc.
>>
>> So, I'm +1 on restricting 'u#' to Unicode objects.
> 
> Note:  2.5 no longer crashes, 2.4 does.
> 
> Does this mean you would like to see this patch checked in to 2.5? 

Yes.

> What should we do about 2.4?

Perhaps you could add a warning that is displayed when
using u# with non-Unicode objects ?!

> Index: Python/getargs.c
> ===================================================================
> --- Python/getargs.c    (revision 45333)
> +++ Python/getargs.c    (working copy)
> @@ -1042,11 +1042,8 @@
>                                 STORE_SIZE(PyUnicode_GET_SIZE(arg));
>                         }
>                         else {
> -                       char *buf;
> -                       Py_ssize_t count = convertbuffer(arg, p, &buf);
> -                       if (count < 0)
> -                               return converterr(buf, arg, msgbuf, bufsize);
> -                       STORE_SIZE(count/(sizeof(Py_UNICODE)));
> +                               return converterr("cannot convert raw buffers"",
> +                                                 arg, msgbuf, bufsize);
>                         }
>                         format++;
>                 } else {
> 
>>> - should Python guarantee that all characters in a Unicode object
>>>   are between 0 and sys.maxunicode? Currently, it is possible to
>>>   create Unicode strings with either negative or very large Py_UNICODE
>>>   elements.
>>>
>>> - if the answer to the last question is no (i.e. if it is intentional
>>>   that a unicode object can contain arbitrary Py_UNICODE values): should
>>>   Python then guarantee that Py_UNICODE is an unsigned type?
>> Py_UNICODE must always be unsigned. The whole implementation
>> relies on this and has been designed with this in mind (see
>> PEP 100). AFAICT, the configure does check that Py_UNICODE
>> is always unsigned.
> 
> Martin fixed the crashing problem in 2.5 by making wchar_t unsigned
> which was a bug.  (A configure test was reversed IIRC.)  Can this
> change to wchar_t be made in 2.4?  That technically changes all the
> interfaces even though it was a mistake.  What should be done for 2.4?

If users want to interface from wchar_t to Python's Unicode
type they have to go through the PyUnicode_FromWideChar()
and PyUnicode_AsWideChar() interfaces. Assuming that Py_UNICODE
is the same as wchar_t is simply wrong (and always was).

I also think that changing the type from signed to unsigned
by backporting the configure fix will only make things safer
for the user, since extensions will probably not even be aware
of the fact that Py_UNICODE could be signed (it has always been
documented to be unsigned).

So +1 on backporting the configure test fix to 2.4.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 13 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::