[ python-Bugs-1251300 ] Decoding with unicode_internal segfaults on UCS-4 builds

Thu Aug 18 22:17:28 CEST 2005

Bugs item #1251300, was opened at 2005-08-03 21:49
Message generated for change (Comment added) made by doerwalter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Submitted By: Nik Haldimann (nhaldimann)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Decoding with unicode_internal segfaults on UCS-4 builds

Initial Comment:
On UCS-4 builds, decoding a byte string with the
unicode_internal codec doesn't correctly work for code
points from 0x80000000 upwards and even segfaults. I
have observed the same behaviour on 2.5 from CVS and
2.4.0 on OS X/PowerPC as well as on 2.3.5 on Linux/x86.
Here's an example:

Python 2.5a0 (#1, Aug  3 2005, 21:34:05) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1671)] on
darwin
Type "help", "copyright", "credits" or "license" for
more information.
>>> "\x7f\xff\xff\xff".decode("unicode_internal")
u'\U7fffffff'
>>> "\x80\x00\x00\x00".decode("unicode_internal")
u'\x00'
>>> "\x80\x00\x00\x01".decode("unicode_internal")
u'\x01'
>>> "\x81\x00\x00\x00".decode("unicode_internal")
Segmentation fault

On little endian architectures the byte strings must be
reversed for the same effect.

I'm not sure if I understand what's going on, but I
guess there are 2 solution strategies:

1. Make unicode_internal work for any code point up to
0xFFFFFFFF.

2. Make unicode_internal raise a UnicodeDecodeError for
anything above 0x10FFFF (== sys.maxunicode for UCS-4
builds).

It seems like there are no unicode code points above
0x10FFFF, so the latter solution feels more correct to
me, even though it might break backwards compatibility
a tiny bit. The unicodeescape codec already does a
similar thing:

>>> u"\U00110000"
UnicodeDecodeError: 'unicodeescape' codec can't decode
bytes in position 0-9: illegal Unicode character

----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2005-08-18 22:17

Message:
Logged In: YES 
user_id=89016

The patch has a problem with input strings of a length that
is not a multiple of 4, e.g.
"\x00".decode("unicode-internal") returns u"" instead of
raising an error. Also in a UCS-2 build most of the tests
are irrelevant (as it's not possible to create codepoints
above 0x10ffff even when using surrogates), so probably they
should be ifdef'd out.

----------------------------------------------------------------------

Comment By: Nik Haldimann (nhaldimann)
Date: 2005-08-05 23:08

Message:
Logged In: YES 
user_id=1317086

Here's the patch with error handler support + test. Again:
Please review carefully.

----------------------------------------------------------------------

Comment By: Nik Haldimann (nhaldimann)
Date: 2005-08-05 18:35

Message:
Logged In: YES 
user_id=1317086

Ah, that PEP clears some things up for me. I will look into
it, but I hope you realize this requires tinkering with
unicodeobject.c since the error handler code seems to live
there.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-08-05 18:03

Message:
Logged In: YES 
user_id=89016

Your patch doesn't support PEP 293 error handlers. Could you
add support for that?

----------------------------------------------------------------------

Comment By: Nik Haldimann (nhaldimann)
Date: 2005-08-05 16:50

Message:
Logged In: YES 
user_id=1317086

OK, I put something together. Please review carefully as I'm
not very familiar with the C API. I have tested this with
the CVS HEAD on OS X and Linux.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-08-04 16:41

Message:
Logged In: YES 
user_id=38388

I think solution 2 is the right approach, since UCS-4 only
has 0x10FFFF possible code points.

Could you provide a patch ?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470