[ python-Bugs-1251300 ] Decoding with unicode_internal segfaults on UCS-4 builds

Thu Aug 4 16:41:23 CEST 2005

Bugs item #1251300, was opened at 2005-08-03 21:49
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Submitted By: nhaldimann (nhaldimann)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Decoding with unicode_internal segfaults on UCS-4 builds

Initial Comment:
On UCS-4 builds, decoding a byte string with the
unicode_internal codec doesn't correctly work for code
points from 0x80000000 upwards and even segfaults. I
have observed the same behaviour on 2.5 from CVS and
2.4.0 on OS X/PowerPC as well as on 2.3.5 on Linux/x86.
Here's an example:

Python 2.5a0 (#1, Aug  3 2005, 21:34:05) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1671)] on
darwin
Type "help", "copyright", "credits" or "license" for
more information.
>>> "\x7f\xff\xff\xff".decode("unicode_internal")
u'\U7fffffff'
>>> "\x80\x00\x00\x00".decode("unicode_internal")
u'\x00'
>>> "\x80\x00\x00\x01".decode("unicode_internal")
u'\x01'
>>> "\x81\x00\x00\x00".decode("unicode_internal")
Segmentation fault

On little endian architectures the byte strings must be
reversed for the same effect.

I'm not sure if I understand what's going on, but I
guess there are 2 solution strategies:

1. Make unicode_internal work for any code point up to
0xFFFFFFFF.

2. Make unicode_internal raise a UnicodeDecodeError for
anything above 0x10FFFF (== sys.maxunicode for UCS-4
builds).

It seems like there are no unicode code points above
0x10FFFF, so the latter solution feels more correct to
me, even though it might break backwards compatibility
a tiny bit. The unicodeescape codec already does a
similar thing:

>>> u"\U00110000"
UnicodeDecodeError: 'unicodeescape' codec can't decode
bytes in position 0-9: illegal Unicode character

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2005-08-04 16:41

Message:
Logged In: YES 
user_id=38388

I think solution 2 is the right approach, since UCS-4 only
has 0x10FFFF possible code points.

Could you provide a patch ?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470