[Python-bugs-list] [ python-Bugs-541828 ] Regression in unicodestr.encode()
noreply@sourceforge.net
noreply@sourceforge.net
Wed, 10 Apr 2002 15:22:04 -0700
Bugs item #541828, was opened at 2002-04-09 21:56
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=541828&group_id=5470
Category: Unicode
Group: Python 2.3
Status: Closed
Resolution: Fixed
Priority: 7
Submitted By: Barry Warsaw (bwarsaw)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Regression in unicodestr.encode()
Initial Comment:
I'm porting over the latest email package to Python
2.3cvs, and I've
had one of my tests fail. I've narrowed it down to the
following test
case:
a =
u'\u6b63\u78ba\u306b\u8a00\u3046\u3068\u7ffb\u8a33\u306f\u3055\u308c\u3066\u3044\u307e\u305b\u3093\u3002\u4e00\u90e8\u306f\u30c9\u30a4\u30c4\u8a9e\u3067\u3059\u304c\u3001\u3042\u3068\u306f\u3067\u305f\u3089\u3081\u3067\u3059\u3002\u5b9f\u969b\u306b\u306f\u300cWenn
ist das Nunstuck git und'
print repr(a.encode('utf-8', 'replace'))
In Python 2.2.1 I get
'\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn
ist das Nunstuck git und'
but in Python 2.3 cvs I get
'\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn
ist das Nunstuck git u\x00\x00'
Note that the last two characters, which should be `n'
and `d' are now
NULs. My very limited Tim-enlightened understanding is
that encoding
a string to UTF-8 should never produce a string with NULs.
----------------------------------------------------------------------
>Comment By: Tim Peters (tim_one)
Date: 2002-04-10 18:22
Message:
Logged In: YES
user_id=31435
Note that the debug-build pymalloc does catch the
overwrite, and complains about it as soon as the fatal
realloc is entered. Unfortunately, the overwrite was so
bad that it also destroyed the "serial number" info the
debug pymalloc tried to display in its error report.
I agree Martin didn't introduce cbWritten (BTW, that kind
of Hungarian naming is a sure sign that *someone* at
Microsoft introduced it <wink>), but don't care where it
came from. What I do care about is that there weren't (and
still aren't) asserts *verifying* that this delicate code
isn't spilling over the allocated bounds.
About timing, last time we went around on this,
the "measure once, cut once" version of the code was
significantly slower in my timing tests too. I don't care
so much if the code is tricky, but the trickier the code
the more asserts are required.
Note that pymalloc's realloc still doesn't give memory back
when a small block is realloc'ed to a smaller size. That
makes the current method enjoy a speed advantage (at the
expense of using more memory) in the usual cases today, but
this special advantage may not persist.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-04-10 17:36
Message:
Logged In: YES
user_id=21627
There is no bug in pymalloc. The codec wrote beyond the end
of the allocated buffer, this causes undefined behaviour.
The malloc implemementation could not possibly know that the
data extends beyond the space it provided to the application.
Python 2.2 suffers from the same problem: If you have a
string of 10 characters, it will allocate 30 bytes. In UCS4
mode, if the first 6 characters consume each 4 bytes, this
will consume 24 bytes, leaving 6 bytes (resizing would only
be triggered if 4 bytes or less would be left). Now, if the
remaining 4 characters each consume 2 bytes, the total size
written will be 32 bytes, causing a write into unallocated
memory by 2 bytes. So this is the same problem.
About cbWritten: it was introduced in unicodeobject.c 2.41,
where the checkin message says
New surrogate support in the UTF-8 codec. By Bill Tutt.
So I'd challenge the claim that this is my doing.
As for computing the size in advance: Your arguments on
performance are not convincing, since your measurements were
flawed.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 16:50
Message:
Logged In: YES
user_id=38388
Just confirmed: Python 2.2.1 definitely doesn't have
this problem.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 16:37
Message:
Logged In: YES
user_id=38388
Fix checked in. Probably does not apply to the 2.2.1 branch
since this uses a different technique.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 14:53
Message:
Logged In: YES
user_id=38388
I'm not in favour of the precomputation. We already had a
discussion about the performance of this.
About the cbWritten thingie: that was your invention, IIRC :-)
I'll try ripping that bit out again and use pointer arithmetics
instead.
Still, I believe the real cause of the problem is in pymalloc,
since a debugging session indicated that the codec did write
the 'n', 'd' characters. It's the final _PyString_Resize() which
causes these to be dropped during the copying of the
memory block.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-04-10 14:07
Message:
Logged In: YES
user_id=21627
It appears that cbWritten can still run above cbAllocated,
namely if a long sequence of 3-byte characters is followed
by a long sequence of 1-byte or 2-byte characters.
I'm still in favour of dropping the resizing of the result
string, and computing the number of bytes in a first run.
The code becomes clearer that way and more performant; see
attached unicode.diff.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=541828&group_id=5470