unicode speed

Tue Nov 29 15:50:19 EST 2005

In article <pan.2005.11.29.08.48.15.951250 at email.cz>,
 David Siroky <dsiroky at email.cz> wrote:

> Hi!
> 
> I need to enlighten myself in Python unicode speed and implementation.
> 
> My platform is AMD Athlon at 1300 (x86-32), Debian, Python 2.4.
> 
> First a simple example (and time results):
> 
> x = "a"*50000000
> real    0m0.195s
> user    0m0.144s
> sys     0m0.046s
> 
> x = u"a"*50000000
> real    0m2.477s
> user    0m2.119s
> sys     0m0.225s
> 
> So my first question is why creation of a unicode string lasts more then 10x
> longer than non-unicode string?

Your first example uses about 50 MB.  Your second uses about 200 MB, (or 
100 MB if your Python is compiled oddly).  Check the size of Unicode 
chars by:

>>> import sys
>>> hex(sys.maxunicode)

If it says '0x10ffff' each unichar uses 4 bytes; if it says '0xffff', 
each unichar uses 2 bytes.

> Another situation: speed problem with long strings
> 
> I have a simple function for removing diacritics from a string:
> 
> #!/usr/bin/python2.4
> # -*- coding: UTF-8 -*-
> 
> import unicodedata
> 
> def no_diacritics(line):
>     if type(line) != unicode:
>         line = unicode(line, 'utf-8')
> 
>     line = unicodedata.normalize('NFKD', line)
> 
>     output = ''
>     for c in line:
>         if not unicodedata.combining(c):
>             output += c
>     return output
> 
> Now the calling sequence (and time results):
> 
> for i in xrange(1):
>     x = u"a"*50000
>     y = no_diacritics(x)
> 
> real    0m17.021s
> user    0m11.139s
> sys     0m5.116s
> 
> for i in xrange(5):
>     x = u"a"*10000
>     y = no_diacritics(x)
> 
> real    0m0.548s
> user    0m0.502s
> sys     0m0.004s
> 
> In both cases the total amount of data is equal but when I use shorter strings
> it is much faster. Maybe it has nothing to do with Python unicode but I would
> like to know the reason.

It has to do with how strings (either kind) are implemented.  Strings 
are "immutable", so string concatination is done by making a new string 
that has the concatenated value, ans assigning it to the left-hand-side.  
Often, it is faster (but more memory intensive) to append to a list and 
then at the end do a u''.join(mylist).  See GvR's essay on optimization 
at <http://www.python.org/doc/essays/list2str.html>.

Alternatively, you could use array.array from the Python Library (it's 
easy) to get something "just as good as" mutable strings.
________________________________________________________________________
TonyN.:'                        *firstname*nlsnews at georgea*lastname*.com
      '                                  <http://www.georgeanelson.com/>