[Patches] [ python-Patches-1454485 ] patch for SIGSEGV in arraymodule.c

Mon Jun 5 12:46:34 CEST 2006

Patches item #1454485, was opened at 2006-03-20 13:44
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1454485&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Core (C code)
Group: Python 2.4
>Status: Closed
>Resolution: Fixed
Priority: 7
Submitted By: Baris Metin (tbmetin)
Assigned to: Martin v. LÃ¶wis (loewis)
Summary: patch for SIGSEGV in arraymodule.c

Initial Comment:
Array module fails handling utf-8 strings giving a  
SIGSEGV. Attached patch seems to do the trick... 

gdb> run   
(no debugging symbols found)   
(no debugging symbols found)   
[Thread debugging using libthread_db enabled]   
[New Thread -1480337216 (LWP 31303)]   
Python 2.4.2 (#1, Mar 20 2006, 12:08:06)   
[GCC 3.4.5] on linux2   
Type "help", "copyright", "credits" or "license" for   
more information.   
>>> import array   
>>> x = array.array("u")   
>>> x.append(u"barÄ±ÅŸ")   
Traceback (most recent call last):   
  File "<stdin>", line 1, in ?   
TypeError: array item must be unicode character   
>>> x.append("barÄ±ÅŸ")   
>>> x   

Program received signal SIGSEGV, Segmentation fault.   
[Switching to Thread -1480337216 (LWP 31303)]   
Error while running hook_stop:   
Invalid type combination in ordering comparison.   
0xa7ee0799 in PyUnicodeUCS4_FromUnicode ()   
from /usr/lib/libpython2.4.so.1.0   

----------------------------------------------------------------------

>Comment By: Martin v. LÃ¶wis (loewis)
Date: 2006-06-05 12:46

Message:
Logged In: YES 
user_id=21627

I added a work-around for the crash in r46669:
PyUnicode_FromUnicode now only uses the shared characters
for non-negative values. Every other patch I thought of
would have broken backwards compatibility in some way.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2006-04-14 07:22

Message:
Logged In: YES 
user_id=33168

I removed the code which tried to convert a buffer (string)
to unicode.

Committed revision 45374.

We still have to address 2.4.  Martin, I can make the fix if
you tell me what to do.

----------------------------------------------------------------------

Comment By: Martin v. LÃ¶wis (loewis)
Date: 2006-04-11 09:16

Message:
Logged In: YES 
user_id=21627

It turns out that the test whether wchar_t is unsigned was
reversed. Correcting the test stops makes the program not
crash anymore; this is fixed in r45264

Not sure whether this can be backported to 2.4; it will mean
that Python will stop using wchar_t as Py_UNICODE on Linux.

----------------------------------------------------------------------

Comment By: Martin v. LÃ¶wis (loewis)
Date: 2006-03-29 20:13

Message:
Logged In: YES 
user_id=21627

I think there is a design choice here: Should it be an
invariant that all characters in a Unicode string are below
sys.maxunicode? If so, the procedures to create them should
check whether that invariant holds. If not, all code that
deals with them most consider cases where they are larger
than sys.maxunicode or smaller than 0.

Also, should it be a requirement that Py_UNICODE is an
unsigned type? Then tests for <0 could be dropped, of course.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2006-03-29 09:52

Message:
Logged In: YES 
user_id=33168

Attached is an updated patch to only do the (unsigned) cast
in unicodeobject.c.  The test included in the patch still
crashes  the interpreter, this time in unicodectype.c.

----------------------------------------------------------------------

Comment By: Martin v. LÃ¶wis (loewis)
Date: 2006-03-26 23:36

Message:
Logged In: YES 
user_id=21627

The second part of the patch (checking that *u is not
negative) is definitely right.

The first part (requiring an even number of bytes for a u#
argument) probably requires discussion on python-dev (or
this patch should be assigned to MAL): I don't think it
should be allowed to pass a non-Unicode object to u# in the
first place.

In particular, if you pass a byte string, there would be an
implicit assumption that the byte encoding is the same
internal representation as a Py_UNICODE. This is bad -
Python normally assumes the encoding of a string is the
system encoding, which normally is ASCII.

Of course, changing the call to a type error for 2.4.3
probably won't work, either, because it might break existing
code.

Anyway, I believe the latter fix alone should fix the crash:
the current getargs implementation will round down to the
next multiple of sizeof(Py_UNICODE), thanks the integer
division. u_setitem will then refuse the call if the length
is not 1. IOW, it is possible to append between 4 and 7
bytes to a Unicode array.

I wonder why the patch fixes the problem: *u should be an
unsigned, and comparing an unsigned with a signed should
convert the signed to unsigned, no?

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2006-03-25 08:29

Message:
Logged In: YES 
user_id=33168

Verified for 2.4 and head.  The probably could exist w/ucs2
also if you use 'bar' (I think).

I agree with Nick, this patch doesn't really solve the
problem.  The attached patch fixes the crash more generally,
but I'm think there is a better solution.  I hope Martin has
time to review this and suggest a better fix.

Martin, the change in getargs ensures that we don't try to
convert an 8-bit string of length 5 to unicode.  The change
in unicodeobject ensures that we don't reference the array
with a negative offset as happens if the buffer conversion
succeeds with an invalid unicode character.

----------------------------------------------------------------------

Comment By: Nick Coghlan (ncoghlan)
Date: 2006-03-24 15:43

Message:
Logged In: YES 
user_id=1038590

To get the effect of the patch, it should be sufficient to
just change the format character to an uppercase U.

That doesn't seem like the right fix though - the actual
explosion isn't happening until later when the array
elements are being converted to Unicode for output.

----------------------------------------------------------------------

Comment By: Baris Metin (tbmetin)
Date: 2006-03-24 10:19

Message:
Logged In: YES 
user_id=1045504

I'm able to reproduce the bug with 2.5a0 SVN (r43289).   

Please try with --enable-unicode=ucs4 

The patch is against svn too.. 

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2006-03-24 07:11

Message:
Logged In: YES 
user_id=33168

With the stock 2.4.2 on my linux box I was able to reproduce
this.  I couldn't reproduce with 2.4.3c1.  Can you verify
this is fixed in 2.4.3?

Sagol.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1454485&group_id=5470