[PyPy-issue] [issue660] str.decode('utf8', 'replace') -- conformance with Unicode 5.2/6.0

Sun Mar 6 08:44:18 CET 2011

New submission from Ezio Melotti <ezio.melotti at gmail.com>:

The attached patch fixes a corner case in the utf8 decoder with some invalid 3-
or 4-bytes sequences when the error handler is "replace".
A more detailed explanation can be found in the CPython issue #8271[0] starting
from the message number 109155[1] (the previous part is already fixed in pypy).
The patch includes extensive tests.

Benchmarks with patch:
Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from timeit import Timer
>>> setup_nonascii = 'from pypy.rlib.runicode import str_decode_utf_8; text =
open("test.txt").read(); l = len(text)'
>>> setup_ascii = 'from pypy.rlib.runicode import str_decode_utf_8; from string
import letters; text = letters*10000; l = len(text)'
>>> Timer('str_decode_utf_8(text, l, "strict")', setup_nonascii).timeit(10)
7.4703819751739502
>>> Timer('str_decode_utf_8(text, l, "ignore")', setup_nonascii).timeit(10)
7.4956531524658203
>>> Timer('str_decode_utf_8(text, l, "replace")', setup_nonascii).timeit(10)
8.0847411155700684
>>> Timer('str_decode_utf_8(text, l, "strict")', setup_ascii).timeit(10)
15.456485033035278
>>> Timer('str_decode_utf_8(text, l, "ignore")', setup_ascii).timeit(10)
14.893633127212524
>>> Timer('str_decode_utf_8(text, l, "replace")', setup_ascii).timeit(10)
15.023200035095215

Benchmarks without patch:
Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from timeit import Timer
>>> setup_nonascii = 'from pypy.rlib.runicode import str_decode_utf_8; text =
open("test.txt").read(); l = len(text)'
>>> setup_ascii = 'from pypy.rlib.runicode import str_decode_utf_8; from string
import letters; text = letters*10000; l = len(text)'
>>> Timer('str_decode_utf_8(text, l, "strict")', setup_nonascii).timeit(10)
27.644415855407715
>>> Timer('str_decode_utf_8(text, l, "ignore")', setup_nonascii).timeit(10)
28.048332929611206
>>> Timer('str_decode_utf_8(text, l, "replace")', setup_nonascii).timeit(10)
28.484920978546143
>>> Timer('str_decode_utf_8(text, l, "strict")', setup_ascii).timeit(10)
15.727217197418213
>>> Timer('str_decode_utf_8(text, l, "ignore")', setup_ascii).timeit(10)
15.779711008071899
>>> Timer('str_decode_utf_8(text, l, "replace")', setup_ascii).timeit(10)
15.517917156219482

A few comments:
* This is not yet fixed in CPython, I started working on it but then figured  it
would have been easier to work on a Python version first;
* The speedup in the patched version is most likely because I removed the use of
pypy.rlib.bitmanipulation.splitter (regular bitwise operations looked simpler
and faster to me, so I used those);
* I moved all the utf-8-related tests in a new class;
* My editor stripped a few trailing spaces here and there that are unrelated to
the patch;
* The patch includes comments, but they are fairly specific. I could write a
more general comment to explain what the decoder and the tests do;
* In the decoder there is some code duplication where the error handler is
called. I can factor out the error message, but that won't make the things much
better;
* Even if this specific corner case is covered only by Unicode 6.0.0, the
general algorithm is described already in Unicode 5.2.0, therefore it should be
fixed even if Python don't use 6.0.0 yet.

[0]: http://bugs.python.org/issue8271
[1]: http://bugs.python.org/issue8271#msg109155

----------
effort: ???
files: issue8271.diff
messages: 2263
nosy: amaury, ezio.melotti, pypy-issue
priority: bug
release: ???
status: unread
title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2/6.0

_______________________________________________________
PyPy development tracker <pypy-dev-issue at codespeak.net>
<https://codespeak.net/issue/pypy-dev/issue660>
_______________________________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: issue8271.diff
Type: text/x-diff
Size: 41871 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pypy-issue/attachments/20110306/b23e8e3a/attachment.diff>