unicode speed
David Siroky
dsiroky at email.cz
Wed Nov 30 04:23:00 EST 2005
V Tue, 29 Nov 2005 10:14:26 +0000, Neil Hodgson napsal(a):
> David Siroky:
>
>> output = ''
>
> I suspect you really want "output = u''" here.
>
>> for c in line:
>> if not unicodedata.combining(c):
>> output += c
>
> This is creating as many as 50000 new string objects of increasing
> size. To build large strings, some common faster techniques are to
> either create a list of characters and then use join on the list or use
> a cStringIO to accumulate the characters.
That is the answer I wanted, now I'm finally enlightened! :-)
>
> This is about 10 times faster for me:
>
> def no_diacritics(line):
> if type(line) != unicode:
> line = unicode(line, 'utf-8')
>
> line = unicodedata.normalize('NFKD', line)
>
> output = []
> for c in line:
> if not unicodedata.combining(c):
> output.append(c)
> return u''.join(output)
>
> Neil
Thanx!
David
More information about the Python-list
mailing list