Sorting by item_in_another_list

Cameron Walsh cameron.walsh at gmail.com
Thu Oct 26 22:58:20 EDT 2006


Steven Bethard wrote:
> 
> As you noted, you'll get an error if you try to concatenate B as a set 
> to the list.
> 
> Steve

Whoops, remind me to check things work before I type them.  In the mean 
time, here are some more interesting timing results:

With a larger data set, 500 elements instead of 100, the times were 
almost the same:

python -m timeit -s "A=range(500); B=range(40,60,2); B_set=set(B)" 
"A=B+list(x for x in A if x not in B_set)"
10000 loops, best of 3: 103 usec per loop

python -m timeit -s "A=range(500); B=range(40,60,2); B_set = set(B)" 
"A.sort(key=B_set.__contains__, reverse=True)"
10000 loops, best of 3: 99.2 usec per loop

But for the last recommended one:

python -m timeit -s "A=range(500); B=range(40,60,2); C = set(A) -set(B)" 
"A=B+filter(C.__contains__, A)"
10000 loops, best of 3: 38.3 usec per loop

Much faster!  But that does not take into account the set creation and 
set difference calculations.  Let's try including that:

python -m timeit -s "A=range(500); B=range(40,60,2)" "C=set(A)-set(B)" 
"a=B+filter(C.__contains__, A)"
10000 loops, best of 3: 105 usec per loop

and compare to our other two versions including set creation:

python -m timeit -s  "A=range(500); B=range(40,60,2)" "B_set = set(B); 
A=B+list(x for x in A if X not in B_set"
10000 loops, best of 3: 105 usec per loop

python -m timeit -s  "A=range(500); B=range(40,60,2)" "B_set = set(B); 
A.sort(key=B_set.__contains__, reverse=True)"
10000 loops, best of 3: 101 usec per loop

Pretty similar again.

However, converting each element to a string first ("A=list(str(x) for x 
in range(500)" etc.), making it more similar to my data, gave:

102usec per loop for the A.sort() method
132usec per loop for the A=B+filter() method
104usec per loop for the A=B+list() method

It is interesting to note the increased time taken by the set 
difference/filter method.  It appears not to like strings as much as 
ints.  We can also shave another 7 usec off the filter time:

python -m timeit -s "A=list(str)x) for x in range(500)); B=list(str(x) 
for x in range(40,60,2)); not_in_B_set=lambda x: x not in B_set" 
"B_set=set(B); A=B+filter(not_in_B_set, A)"
10000 loops, best of 3: 125 usec per loop

In conclusion, it appears I should use the clearest of all the methods, 
since for the data sizes I am dealing with, they are of approximately 
the same speed, and the time taken is negligible anyway.

Thanks for all your help,

Cameron.



More information about the Python-list mailing list