unicode() vs. s.decode()

Sat Aug 8 13:00:11 EDT 2009

* Michael Ströder (Sat, 08 Aug 2009 15:09:23 +0200)
> Thorsten Kampe wrote:
> > * Steven D'Aprano (08 Aug 2009 03:29:43 GMT)
> >> But why assume that the program takes 8 minutes to run? Perhaps it takes 
> >> 8 seconds to run, and 6 seconds of that is the decoding. Then halving 
> >> that reduces the total runtime from 8 seconds to 5, which is a noticeable 
> >> speed increase to the user, and significant if you then run that program 
> >> tens of thousands of times.
> > 
> > Exactly. That's why it doesn't make sense to benchmark decode()/unicode
> > () isolated - meaning out of the context of your actual program.
> 
> Thorsten, the point is you're too arrogant to admit that making such a general
> statement like you did without knowing *anything* about the context is simply
> false.

I made a general statement to a very general question ("These both 
expressions are equivalent but which is faster or should be used for any 
reason?"). If you have specific needs or reasons then you obviously 
failed to provide that specific "context" in your question.

> >> By all means, reminding people that pre-mature optimization is a 
> >> waste of time, but it's possible to take that attitude too far to Planet 
> >> Bizarro. At the point that you start insisting, and emphasising, that a 
> >> three second time difference is "*exactly*" zero,
> > 
> > Exactly. Because it was not generated in a real world use case but by 
> > running a simple loop one millions times. Why one million times? Because 
> > by running it "only" one hundred thousand times the difference would 
> > have seen even less relevant.
> 
> I was running it one million times to mitigate influences on the timing by
> other background processes which is a common technique when benchmarking.

Err, no. That is what "repeat" is for and it defaults to 3 ("This means 
that other processes running on the same computer may interfere with the 
timing. The best thing to do when accurate timing is necessary is to 
repeat the timing a few times and use the best time. [...] the default 
of 3 repetitions is probably enough in most cases.")

Three times - not one million times. You choose one million times (for 
the loop) when the thing you're testing is very fast (like decoding) and 
you don't want results in the 0.00000n range. Which is what you asked 
for and what you got.

> > I already gave good advice:
> > 1. don't benchmark
> > 2. don't benchmark until you have an actual performance issue
> > 3. if you benchmark then the whole application and not single commands
> 
> You don't know anything about what I'm doing and what my aim is. So your
> general rules don't apply.

See above. You asked a general question, you got a general answer.

> > It's really easy: Michael has working code. With that he can easily 
> > write two versions - one that uses decode() and one that uses unicode().
> 
> Yes, I have working code which was originally written before .decode() being
> added in Python 2.2. Therefore I wondered whether it would be nice for
> readability to replace unicode() by s.decode() since the software does not
> support Python versions prior 2.3 anymore anyway. But one aspect is also
> performance and hence my question and testing.

You haven't done any testing yet. Running decode/unicode one million 
times in a loop is not testing. If you don't believe me then read at 
least Martelli's Optimization chapter in Python in a nutshell (the 
chapter is available via Google books).

Thorsten