[Python-Dev] Bug in PyLocale_strcoll

Mon Nov 22 14:03:23 CET 2004

Andreas Degert wrote:
> "M.-A. Lemburg" <mal at egenix.com> writes:
> 
> 
>>You're right: they are always 0-terminated just like 8-bit strings
>>and even though it doesn't seem to be necessary since Python
>>functions will always use the size field when working on
>>a Unicode object rather than rely on the 0-termination.
> 
> 
> OK, should be documented in the code

It is, but I wasn't sure whether it is really such a good
idea to waist the extra memory and wanted to keep the option
of removing the 0-termination.

>>>Ok... I'm still not sure if I should file a bug for PyLocale_strcoll
>>>or PyUnicode_AsWideChar and if the patch for the latter should assume
>>>that the unicode string buffer is 0-terminated...
>>
>>I think it's probably wise to fix both:
>>
>>Looking again, the patch we applied to PyUnicode_AsWideChar()
>>only fixes the 0-termination problem in the case where
>>HAVE_USABLE_WCHAR_T is set. This should be extended to
>>the memcpy() as well.
> 
> 
> What I read from the code is that now in both cases the string is
> copied without 0 and that is consistent with the size the buffer is
> checked for (PyUnicode_GET_SIZE gives the value of the length field
> and that doesn't include the 0-termination)
> 
> 
>>Still, if the buffer passed to PyUnicode_AsWideChar()
>>is not big enough, you won't get the 0-termination (due
>>to truncation), so PyLocale_strcoll() must be either very
>>careful to allocate a buffer that is always big enough
>>or apply 0-termination itself.
> 
> 
> PyLocale_strcoll() acts quite careful but even so it didn't get what
> it expected ;-). This bug is masked by the bug you referred to when
> the copy loop is used (ie. if wchar sizes don't match) and the output
> buffer string is big enough (like in the strcoll case because the
> buffer size already accounts for the 0-termination).
> 
> I appended a (untested) patch for unicodeobject.c.

I've just checked in a patch which should correct the
problem.

> The documentation should be clarified too. Would a patch against
> concrete.tex be accepted where I change
> 
> - 'Unicode object' to 'Unicode string' when only the string part of
>   the python object is referenced,

Not sure what you mean here.

> - 'size of the object' to 'length of the string'

Dito.

> - mention the 0-termination of the return-value of
>   PyUnicode_AS_UNICODE()
> 
> - mention the 0-termination of the return-value of
>   PyUnicode_AsWideChar

I don't think we should document this. Programmers should always
use the size of the object rather than rely on the 0-termination.

> - '... represents a 16-bit...' to something that explains 16 vs. 32
>   but depending on internal representation (UCS-2 or UCS-4) selected at
>   compile time

+1

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 22 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::