[issue5127] UnicodeEncodeError - I can't even see license

Mon Oct 5 13:41:17 CEST 2009

Marc-Andre Lemburg <mal at egenix.com> added the comment:

Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa at gmail.com> added the comment:
> 
>> we should make sure that it's not possible to load an extension
>> compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.
> 
> This is the case with this patch: today all these functions
> (_PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to
> _PyUnicodeUCS2_* or _PyUnicodeUCS4_*.
> The patch removes the #defines: 3.1 modules that call
> _PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter.

True, but we can do better. For narrow builds, the API currently
exposes the UCS2 APIs. We'd need to expose the UCS4 APIs *in addition*
to those APIs and have the UCS2 APIs redirect to the UCS4 ones.

For wide builds, we don't need to change anything.

>> The change affects the Unicode type database which is implemented
>> in unicodectype.c, not the Unicode database, which already uses UCS4.
> 
> Are you referring to the _PyUnicode_TypeRecord structure?
> The first three fields only contains values up to 65535, so they could
> use "unsigned short" even for UCS4 builds.

I haven't checked, but it's certainly possible to have a code point
use a non-BMP lower/upper/title case mapping, so this should be
made possible as well, if we're going to make changes to the type
database.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________