[Python-Dev] Unicode charmap decoders slow

Wed Oct 5 17:08:04 CEST 2005

Martin v. Löwis wrote:

> Tony Nelson wrote:
> 
>>> For decoding it should be sufficient to use a unicode string of
>>> length 256. u"\ufffd" could be used for "maps to undefined". Or the
>>> string might be shorter and byte values greater than the length of
>>> the string are treated as "maps to undefined" too.
>>
>> With Unicode using more than 64K codepoints now, it might be more forward
>> looking to use a table of 256 32-bit values, with no need for tricky
>> values.
> 
> You might be missing the point. \ufffd is REPLACEMENT CHARACTER,
> which would indicate that the byte with that index is really unused
> in that encoding.

OK, here's a patch that implements this enhancement to 
PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939

The mapping argument to PyUnicode_DecodeCharmap() can be a unicode 
string and is used as a decoding table.

Speed looks like this:

python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
1000 loops, best of 3: 538 usec per loop
python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
100 loops, best of 3: 3.85 msec per loop
./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
1000 loops, best of 3: 539 usec per loop
./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
1000 loops, best of 3: 623 usec per loop

Creating the decoding_map as a string should probably be done by 
gencodec.py directly. This way the first import of the codec would be 
faster too.

Bye,
    Walter Dörwald