[Python-Dev] Bug in PyLocale_strcoll

Mon Nov 22 12:46:01 CET 2004

"M.-A. Lemburg" <mal at egenix.com> writes:

> You're right: they are always 0-terminated just like 8-bit strings
> and even though it doesn't seem to be necessary since Python
> functions will always use the size field when working on
> a Unicode object rather than rely on the 0-termination.

OK, should be documented in the code

>> Ok... I'm still not sure if I should file a bug for PyLocale_strcoll
>> or PyUnicode_AsWideChar and if the patch for the latter should assume
>> that the unicode string buffer is 0-terminated...
>
> I think it's probably wise to fix both:
>
> Looking again, the patch we applied to PyUnicode_AsWideChar()
> only fixes the 0-termination problem in the case where
> HAVE_USABLE_WCHAR_T is set. This should be extended to
> the memcpy() as well.

What I read from the code is that now in both cases the string is
copied without 0 and that is consistent with the size the buffer is
checked for (PyUnicode_GET_SIZE gives the value of the length field
and that doesn't include the 0-termination)

> Still, if the buffer passed to PyUnicode_AsWideChar()
> is not big enough, you won't get the 0-termination (due
> to truncation), so PyLocale_strcoll() must be either very
> careful to allocate a buffer that is always big enough
> or apply 0-termination itself.

PyLocale_strcoll() acts quite careful but even so it didn't get what
it expected ;-). This bug is masked by the bug you referred to when
the copy loop is used (ie. if wchar sizes don't match) and the output
buffer string is big enough (like in the strcoll case because the
buffer size already accounts for the 0-termination).

I appended a (untested) patch for unicodeobject.c.

The documentation should be clarified too. Would a patch against
concrete.tex be accepted where I change

- 'Unicode object' to 'Unicode string' when only the string part of
  the python object is referenced,

- 'size of the object' to 'length of the string'

- mention the 0-termination of the return-value of
  PyUnicode_AS_UNICODE()

- mention the 0-termination of the return-value of
  PyUnicode_AsWideChar

- '... represents a 16-bit...' to something that explains 16 vs. 32
  but depending on internal representation (UCS-2 or UCS-4) selected at
  compile time

--- unicodeobject.c-2.229	Mon Nov 22 10:58:42 2004
+++ unicodeobject.c	Mon Nov 22 11:10:07 2004
@@ -144,8 +144,7 @@
         return -1;
     }
 
-    /* We allocate one more byte to make sure the string is
-       Ux0000 terminated -- XXX is this needed ? */
+    /* We allocate one more Py_UNICODE and make the string Ux0000 terminated */
     oldstr = unicode->str;
     PyMem_RESIZE(unicode->str, Py_UNICODE, length + 1);
     if (!unicode->str) {
@@ -167,8 +166,7 @@
     return 0;
 }
 
-/* We allocate one more byte to make sure the string is
-   Ux0000 terminated -- XXX is this needed ?
+/* We allocate length+1 and make the string Ux0000 terminated
 
    XXX This allocator could further be enhanced by assuring that the
        free list never reduces its size below 1.
@@ -384,8 +382,10 @@
 	PyErr_BadInternalCall();
 	return -1;
     }
-    if (size > PyUnicode_GET_SIZE(unicode))
-	size = PyUnicode_GET_SIZE(unicode);
+    /* copy all chars including 0-termination if the output buffer
+       size is sufficient */
+    if (size > PyUnicode_GET_SIZE(unicode)+1)
+	size = PyUnicode_GET_SIZE(unicode)+1;
 #ifdef HAVE_USABLE_WCHAR_T
     memcpy(w, unicode->str, size * sizeof(wchar_t));
 #else