[ python-Bugs-1251300 ] Decoding with unicode_internal segfaults on UCS-4 builds

Fri Aug 19 16:17:47 CEST 2005

Bugs item #1251300, was opened at 2005-08-03 21:49
Message generated for change (Comment added) made by nhaldimann
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Submitted By: Nik Haldimann (nhaldimann)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Decoding with unicode_internal segfaults on UCS-4 builds

Initial Comment:
On UCS-4 builds, decoding a byte string with the
unicode_internal codec doesn't correctly work for code
points from 0x80000000 upwards and even segfaults. I
have observed the same behaviour on 2.5 from CVS and
2.4.0 on OS X/PowerPC as well as on 2.3.5 on Linux/x86.
Here's an example:

Python 2.5a0 (#1, Aug  3 2005, 21:34:05) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1671)] on
darwin
Type "help", "copyright", "credits" or "license" for
more information.
>>> "\x7f\xff\xff\xff".decode("unicode_internal")
u'\U7fffffff'
>>> "\x80\x00\x00\x00".decode("unicode_internal")
u'\x00'
>>> "\x80\x00\x00\x01".decode("unicode_internal")
u'\x01'
>>> "\x81\x00\x00\x00".decode("unicode_internal")
Segmentation fault

On little endian architectures the byte strings must be
reversed for the same effect.

I'm not sure if I understand what's going on, but I
guess there are 2 solution strategies:

1. Make unicode_internal work for any code point up to
0xFFFFFFFF.

2. Make unicode_internal raise a UnicodeDecodeError for
anything above 0x10FFFF (== sys.maxunicode for UCS-4
builds).

It seems like there are no unicode code points above
0x10FFFF, so the latter solution feels more correct to
me, even though it might break backwards compatibility
a tiny bit. The unicodeescape codec already does a
similar thing:

>>> u"\U00110000"
UnicodeDecodeError: 'unicodeescape' codec can't decode
bytes in position 0-9: illegal Unicode character

----------------------------------------------------------------------

>Comment By: Nik Haldimann (nhaldimann)
Date: 2005-08-19 16:17

Message:
Logged In: YES 
user_id=1317086

I agree about the ifdefs. I'm not sure about how to handle
input strings of incorrect length. I guess raising an
UnicodeDecodeError is in order. But I think it doesn't make
sense to let it pass through the error handler, since the
data the handler would see is potentially nonsensical (e.g.,
the code point value). Can you comment on this? Is it ok to
raise a UnicodeDecodeError and skip the error handler here?

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-08-18 22:17

Message:
Logged In: YES 
user_id=89016

The patch has a problem with input strings of a length that
is not a multiple of 4, e.g.
"\x00".decode("unicode-internal") returns u"" instead of
raising an error. Also in a UCS-2 build most of the tests
are irrelevant (as it's not possible to create codepoints
above 0x10ffff even when using surrogates), so probably they
should be ifdef'd out.

----------------------------------------------------------------------

Comment By: Nik Haldimann (nhaldimann)
Date: 2005-08-05 23:08

Message:
Logged In: YES 
user_id=1317086

Here's the patch with error handler support + test. Again:
Please review carefully.

----------------------------------------------------------------------

Comment By: Nik Haldimann (nhaldimann)
Date: 2005-08-05 18:35

Message:
Logged In: YES 
user_id=1317086

Ah, that PEP clears some things up for me. I will look into
it, but I hope you realize this requires tinkering with
unicodeobject.c since the error handler code seems to live
there.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-08-05 18:03

Message:
Logged In: YES 
user_id=89016

Your patch doesn't support PEP 293 error handlers. Could you
add support for that?

----------------------------------------------------------------------

Comment By: Nik Haldimann (nhaldimann)
Date: 2005-08-05 16:50

Message:
Logged In: YES 
user_id=1317086

OK, I put something together. Please review carefully as I'm
not very familiar with the C API. I have tested this with
the CVS HEAD on OS X and Linux.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-08-04 16:41

Message:
Logged In: YES 
user_id=38388

I think solution 2 is the right approach, since UCS-4 only
has 0x10FFFF possible code points.

Could you provide a patch ?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470