[Python-bugs-list] [ python-Bugs-541828 ] Regression in unicodestr.encode()

Wed, 10 Apr 2002 15:22:04 -0700

Bugs item #541828, was opened at 2002-04-09 21:56
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=541828&group_id=5470

Category: Unicode
Group: Python 2.3
Status: Closed
Resolution: Fixed
Priority: 7
Submitted By: Barry Warsaw (bwarsaw)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Regression in unicodestr.encode()

Initial Comment:
I'm porting over the latest email package to Python
2.3cvs, and I've
had one of my tests fail.  I've narrowed it down to the
following test
case:

a =
u'\u6b63\u78ba\u306b\u8a00\u3046\u3068\u7ffb\u8a33\u306f\u3055\u308c\u3066\u3044\u307e\u305b\u3093\u3002\u4e00\u90e8\u306f\u30c9\u30a4\u30c4\u8a9e\u3067\u3059\u304c\u3001\u3042\u3068\u306f\u3067\u305f\u3089\u3081\u3067\u3059\u3002\u5b9f\u969b\u306b\u306f\u300cWenn
ist das Nunstuck git und'
print repr(a.encode('utf-8', 'replace'))

In Python 2.2.1 I get

'\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn
ist das Nunstuck git und'

but in Python 2.3 cvs I get

'\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn
ist das Nunstuck git u\x00\x00'

Note that the last two characters, which should be `n'
and `d' are now
NULs.  My very limited Tim-enlightened understanding is
that encoding
a string to UTF-8 should never produce a string with NULs.

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2002-04-10 18:22

Message:
Logged In: YES 
user_id=31435

Note that the debug-build pymalloc does catch the 
overwrite, and complains about it as soon as the fatal 
realloc is entered.  Unfortunately, the overwrite was so 
bad that it also destroyed the "serial number" info the 
debug pymalloc tried to display in its error report.

I agree Martin didn't introduce cbWritten (BTW, that kind 
of Hungarian naming is a sure sign that *someone* at 
Microsoft introduced it <wink>), but don't care where it 
came from.  What I do care about is that there weren't (and 
still aren't) asserts *verifying* that this delicate code 
isn't spilling over the allocated bounds.

About timing, last time we went around on this, 
the "measure once, cut once" version of the code was 
significantly slower in my timing tests too.  I don't care 
so much if the code is tricky, but the trickier the code 
the more asserts are required.

Note that pymalloc's realloc still doesn't give memory back 
when a small block is realloc'ed to a smaller size.  That 
makes the current method enjoy a speed advantage (at the 
expense of using more memory) in the usual cases today, but 
this special advantage may not persist.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-04-10 17:36

Message:
Logged In: YES 
user_id=21627

There is no bug in pymalloc. The codec wrote beyond the end
of the allocated buffer, this causes undefined behaviour.
The malloc implemementation could not possibly know that the
data extends beyond the space it provided to the application.

Python 2.2 suffers from the same problem: If you have a
string of 10 characters, it will allocate 30 bytes. In UCS4
mode, if the first 6 characters consume each 4 bytes, this
will consume 24 bytes, leaving 6 bytes (resizing would only
be triggered if 4 bytes or less would be left). Now, if the
remaining 4 characters each consume 2 bytes, the total size
written will be 32 bytes, causing a write into unallocated
memory by 2 bytes. So this is the same problem.

About cbWritten: it was introduced in unicodeobject.c 2.41,
where the checkin message says

  New surrogate support in the UTF-8 codec. By Bill Tutt.

So I'd challenge the claim that this is my doing.

As for computing the size in advance: Your arguments on
performance are not convincing, since your measurements were
flawed.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 16:50

Message:
Logged In: YES 
user_id=38388

Just confirmed: Python 2.2.1 definitely doesn't have
this problem.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 16:37

Message:
Logged In: YES 
user_id=38388

Fix checked in. Probably does not apply to the 2.2.1 branch
since this uses a different technique.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 14:53

Message:
Logged In: YES 
user_id=38388

I'm not in favour of the precomputation. We already had a
discussion about the performance of this.

About the cbWritten thingie: that was your invention, IIRC :-)
I'll try ripping that bit out again and use pointer arithmetics
instead.

Still, I believe the real cause of the problem is in pymalloc,
since a debugging session indicated that the codec did write
the 'n', 'd' characters. It's the final _PyString_Resize() which
causes these to be dropped during the copying of the
memory block.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-04-10 14:07

Message:
Logged In: YES 
user_id=21627

It appears that cbWritten can still run above cbAllocated,
namely if a long sequence of 3-byte characters is followed
by a long sequence of 1-byte or 2-byte characters.

I'm still in favour of dropping the resizing of the result
string, and computing the number of bytes in a first run.
The code becomes clearer that way and more performant; see
attached unicode.diff.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=541828&group_id=5470