unicode speed

Tue Nov 29 03:48:15 EST 2005

Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon at 1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*50000000
real    0m0.195s
user    0m0.144s
sys     0m0.046s

x = u"a"*50000000
real    0m2.477s
user    0m2.119s
sys     0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

Another situation: speed problem with long strings

I have a simple function for removing diacritics from a string:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-

import unicodedata

def no_diacritics(line):
    if type(line) != unicode:
        line = unicode(line, 'utf-8')

    line = unicodedata.normalize('NFKD', line)

    output = ''
    for c in line:
        if not unicodedata.combining(c):
            output += c
    return output

Now the calling sequence (and time results):

for i in xrange(1):
    x = u"a"*50000
    y = no_diacritics(x)

real    0m17.021s
user    0m11.139s
sys     0m5.116s

for i in xrange(5):
    x = u"a"*10000
    y = no_diacritics(x)

real    0m0.548s
user    0m0.502s
sys     0m0.004s

In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.

Thanks for notes!

David