[Python-Dev] Unicode charmap decoders slow
Tony Nelson
tonynelson at georgeanelson.com
Tue Oct 4 03:11:29 CEST 2005
Is there a faster way to transcode from 8-bit chars (charmaps) to utf-8
than going through unicode()?
I'm writing a small card-file program. As a test, I use a 53 MB MBox file,
in mac-roman encoding. My program reads and parses the file into messages
in about 3 to 5 seconds (Wow! Go Python!), but takes about 14 seconds to
iterate over the cards and convert them to utf-8:
for i in xrange(len(cards)):
u = unicode(cards[i], encoding)
cards[i] = u.encode('utf-8')
The time is nearly all in the unicode() call. It's not so much how much
time it takes, but that it takes 4 times as long as the real work, just to
do table lookups.
Looking at the source (which, if I have it right, is
PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a
dictionary lookup for each character. I would have thought that it would
make and cache a LUT the size of the charmap (and hook the relevent
dictionary stuff to delete the cached LUT if the dictionary is changed).
(You may consider this a request for enhancement. ;)
I thought of using U"".translate(), but the unicode version is defined to
be slow, and anyway I can't find any way to just shove my 8-bit data into a
unicode string without translation. Is there some similar approach? I'm
almost (but not quite) ready to try it in Pyrex.
I'm new to Python. I didn't google anything relevent on python.org or in
groups. I posted this in comp.lang.python yesterday, got a couple of
responses, but I think this may be too sophisticated a question for that
group.
I'm not a member of this list, so please copy me on replies so I don't have
to hunt them down in the archive.
____________________________________________________________________
TonyN.:' <mailto:tonynelson at georgeanelson.com>
' <http://www.georgeanelson.com/>
More information about the Python-Dev
mailing list