unicode speed

Wed Nov 30 04:23:00 EST 2005

V Tue, 29 Nov 2005 10:14:26 +0000, Neil Hodgson napsal(a):

> David Siroky:
> 
>>     output = ''
> 
>     I suspect you really want "output = u''" here.
> 
>>     for c in line:
>>         if not unicodedata.combining(c):
>>             output += c
> 
>     This is creating as many as 50000 new string objects of increasing 
> size. To build large strings, some common faster techniques are to 
> either create a list of characters and then use join on the list or use 
> a cStringIO to accumulate the characters.

That is the answer I wanted, now I'm finally enlightened! :-)

> 
>     This is about 10 times faster for me:
> 
> def no_diacritics(line):
>      if type(line) != unicode:
>          line = unicode(line, 'utf-8')
> 
>      line = unicodedata.normalize('NFKD', line)
> 
>      output = []
>      for c in line:
>          if not unicodedata.combining(c):
>              output.append(c)
>      return u''.join(output)
> 
>     Neil

Thanx!

David