unicode() vs. s.decode()

Thu Aug 6 12:26:09 EDT 2009

Thorsten Kampe wrote:
> * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)
>> These both expressions are equivalent but which is faster or should be
>> used for any reason?
>>
>> u = unicode(s,'utf-8')
>>
>> u = s.decode('utf-8') # looks nicer
> 
> "decode" was added in Python 2.2 for the sake of symmetry to encode(). 

Yes, and I like the style. But...

> It's essentially the same as unicode() and I wouldn't be surprised if it 
> is exactly the same.

Did you try?

> I don't think any measurable speed increase will be noticeable between
> those two.

Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset):

Python 2.6 (r26:66714, Feb  3 2009, 20:52:03)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf-8')").timeit(1000000)
7.2721178531646729
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000000)
7.1302499771118164
>>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000)
8.3726329803466797
>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000000)
1.8622009754180908
>>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(1000000)
8.651669979095459
>>>

Comparing again the two best combinations:

>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
17.23644495010376
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
72.087096929550171

That is significant! So the winner is:

unicode('äöüÄÖÜß','utf-8')

Ciao, Michael.