[Python-Dev] PEP 393 decode() oddity

Sun Mar 25 18:25:10 CEST 2012

PEP 393 (Flexible String Representation) is, without doubt, one of the 
pearls of the Python 3.3. In addition to reducing memory consumption, it 
also often leads to a corresponding increase in speed. In particular, 
the string encoding now in 1.5-3 times faster.

But decoding is not so good. Here are the results of measuring the 
performance of the decoding of the 1000-character string consisting of 
characters from different ranges of the Unicode, for three versions of 
Python -- 2.7.3rc2, 3.2.3rc2+ and 3.3.0a1+. Little-endian 32-bit i686 
builds, gcc 4.4.

encoding  string                 2.7   3.2   3.3

ascii     " " * 1000             5.4   5.3   1.2

latin1    " " * 1000             1.8   1.7   1.3
latin1    "\u0080" * 1000        1.7   1.6   1.0

utf-8     " " * 1000             6.7   2.4   2.1
utf-8     "\u0080" * 1000       12.2  11.0  13.0
utf-8     "\u0100" * 1000       12.2  11.1  13.6
utf-8     "\u0800" * 1000       14.7  14.4  17.2
utf-8     "\u8000" * 1000       13.9  13.3  17.1
utf-8     "\U00010000" * 1000   17.3  17.5  21.5

utf-16le  " " * 1000             5.5   2.9   6.5
utf-16le  "\u0080" * 1000        5.5   2.9   7.4
utf-16le  "\u0100" * 1000        5.5   2.9   8.9
utf-16le  "\u0800" * 1000        5.5   2.9   8.9
utf-16le  "\u8000" * 1000        5.5   7.5  21.3
utf-16le  "\U00010000" * 1000    9.6  12.9  30.1

utf-16be  " " * 1000             5.5   3.0   9.0
utf-16be  "\u0080" * 1000        5.5   3.1   9.8
utf-16be  "\u0100" * 1000        5.5   3.1  10.4
utf-16be  "\u0800" * 1000        5.5   3.1  10.4
utf-16be  "\u8000" * 1000        5.5   6.6  21.2
utf-16be  "\U00010000" * 1000    9.6  11.2  28.9

utf-32le  " " * 1000            10.2  10.4  15.1
utf-32le  "\u0080" * 1000       10.0  10.4  16.5
utf-32le  "\u0100" * 1000       10.0  10.4  19.8
utf-32le  "\u0800" * 1000       10.0  10.4  19.8
utf-32le  "\u8000" * 1000       10.1  10.4  19.8
utf-32le  "\U00010000" * 1000   11.7  11.3  20.2

utf-32be  " " * 1000            10.0  11.2  15.0
utf-32be  "\u0080" * 1000       10.1  11.2  16.4
utf-32be  "\u0100" * 1000       10.0  11.2  19.7
utf-32be  "\u0800" * 1000       10.1  11.2  19.7
utf-32be  "\u8000" * 1000       10.1  11.2  19.7
utf-32be  "\U00010000" * 1000   11.7  11.2  20.2

The first oddity in that the characters from the second half of the 
Latin1 table decoded faster than the characters from the first half. I 
think that the characters from the first half of the table must be 
decoded as quickly.

The second sad oddity in that UTF-16 decoding in 3.3 is much slower than 
even in 2.7. Compared with 3.2 decoding is slower in 2-3 times. This is 
a considerable regress. UTF-32 decoding is also slowed down by 1.5-2 times.

The fact that in some cases UTF-8 decoding also slowed, is not 
surprising. I believe, that on a platform with a 64-bit long, there may 
be other oddities.

How serious a problem this is for the Python 3.3 release? I could do the 
optimization, if someone is not working on this already.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench_decode.py
Type: text/x-python
Size: 806 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120325/a599326c/attachment.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench_decode-2.py
Type: text/x-python
Size: 810 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120325/a599326c/attachment-0001.py>