[issue8941] utf-32be codec failing on UCS-2 python build for 32-bit value

Wed Jun 9 14:15:12 CEST 2010

Marc-Andre Lemburg <mal at egenix.com> added the comment:

Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou at free.fr> added the comment:
> 
> The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy when codec endianness doesn't match the native endianness (not to mention it could also crash if the underlying CPU arch doesn't support unaligned access to 4-byte integers):
> 
> #ifndef Py_UNICODE_WIDE
>     for (i = pairs = 0; i < size/4; i++)
>         if (((Py_UCS4 *)s)[i] >= 0x10000)
>             pairs++;
> #endif

Good catch !

I wonder whether it wouldn't be better to preallocate
a Unicode object with size of e.g. size/4 + 16 and
then resize the object as necessary in case a surrogate
pair needs to be created (won't happen that often in
practice).

The extra scan for pairs can take long depending on
how much data you have to decode and likely doesn't
go down well with CPU caches.

----------
title: utf-32be codec failing on UCS-2 python build for 32-bit value -> utf-32be codec failing on UCS-2 python build for 32-bit	value

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8941>
_______________________________________