Set & Frozenset?

Wed Mar 11 10:53:40 EDT 2009

R. David Murray wrote:
> Lie Ryan <lie.1296 at gmail.com> wrote:
>> Matt Nordhoff wrote:
>>> Alan G Isaac wrote:
>>>>> Hans Larsen schrieb:
>>>>>>             How could I "take" an elemment from a set or a frozenset 
>>>> On 3/8/2009 2:06 PM Diez B. Roggisch apparently wrote:
>>>>> You iterate over them. If you only want one value, use
>>>>> iter(the_set).next()
>>>> I recall a claim that
>>>>
>>>>     for result in myset: break
>>>>
>>>> is the most efficient way to get one result.
>>>> Is this right? (It seems nearly the same.)
>>>>
>>>> Alan Isaac
>>> Checking Python 2.5 on Linux, your solution is much faster, but seeing
>>> as they both come in under a microsecond, it hardly matters.
>>
>> It's unexpected...
>>
>>  >>> timeit.timeit('res=iter(myset).next()', 'myset=range(1000000)')
>> 0.88944123999999647
>>  >>> timeit.timeit('res=myset.next()', 'myset=range(1000000); 
>> myset=iter(myset)')
>> 0.49165520000002516
>>  >>> timeit.timeit('for res in myset: break', 'myset=range(1000000)')
>> 0.32933007999997699
>>
>> I'd never expect that for-loop assignment is even faster than a 
>> precreated iter object (the second test)... but I don't think this 
>> for-looping variable leaking behavior is guaranteed, isn't it?
> 
> My guess would be that what's controlling the timing here is
> name lookup.  Three in the first example, two in the second,
> and one in the third.

You got it:

 >>> timeit.timeit('res=myset()', 'myset=range(1000000); 
myset=iter(myset).next')
0.26465903999999796

----------------------

The following is a complete benchmark:

 >>> timeit.timeit('res=iter(myset).next()', 'myset=range(10000000)', 
number=10000000)
8.5145002000000432

 >>> timeit.timeit('res=myset.next()', 'myset=range(10000000); 
myset=iter(myset)', number=10000000)
4.5509802800000898

 >>> timeit.timeit('for res in myset: break', 'myset=range(10000000)', 
number=10000000)
2.9994213600000421

 >>> timeit.timeit('res=myset()', 'myset=range(10000000); 
myset=iter(myset).next', number=10000000)
2.2228832400001011

----------------------
I also performed additional timing for overhead:

Local name lookup:
 >>> timeit.timeit('myset', 'myset=range(1000000)', number=10000000)
1.1086400799999865

Global name lookup:
 >>> timeit.timeit('iter', number=10000000)
1.8149410799999259

Attribute lookup:
 >>> timeit.timeit('myset.next', 'myset=range(10000000); 
myset=iter(myset)', number=10000000)
3.3011333999997987

Combined multiple name lookup that troubled first test
 >>> timeit.timeit('iter(myset).next', 'myset=range(10000000)', 
number=10000000)
6.5599374800000305

Creating iterables:
 >>> timeit.timeit('iter(myset)', 'myset=range(1000000)', number=10000000)
4.259406719999788

----------------------
So adjusting the overheads:

Attribute lookup:
 >>> timeit.timeit('myset.next', 'myset=range(10000000); 
myset=iter(myset)', number=10000000)
3.3011333999997987
The timing for Attribute also include a local name lookup (myset), so 
the real attribute lookup time shold be:
3.3011333999997987 - 1.1086400799999865 = 2.1924933199998122

Creating iterables:
 >>> timeit.timeit('iter(myset)', 'myset=range(1000000)', number=10000000)
4.259406719999788
Creating iterable involve global name lookup, so the real time should be:
4.259406719999788 - 1.8149410799999259 = 2.4444656399998621

----------------------
To summarize the adjusted overheads:

Local name lookup: 1.1086400799999865
Global name lookup: 1.8149410799999259
Attribute lookup: 2.1924933199998122
Creating iterables: 2.4444656399998621

----------------------
Back to the problem, now we'll be adjusting the timing of each codes:
'res=iter(myset).next()': 8.5145002000000432
Adjusting with the "Combined multiple name lookup"
8.5145002000000432 - 6.5599374800000305 = 1.9545627200000126
Another way to do the adjustment:
Adjusting global name lookup (iter):
8.5145002000000432 - 1.8149410799999259 = 6.6995591200001172
Adjusting iterable creation:
6.6995591200001172 - 2.4444656399998621 = 4.2550934800002551
Adjusting attribute lookup:
4.2550934800002551 - 2.1924933199998122 = 2.0626001600004429

'res=myset.next()': 4.5509802800000898
Adjusting with |unadjusted| attribute lookup:
4.5509802800000898 - 3.3011333999997987 = 1.2498468800002911
Another way to do the adjustment:
Adjusting with local name lookup:
4.5509802800000898 - 1.1086400799999865 = 3.4423402000001033
Adjusting with attribute lookup:
3.4423402000001033 - 2.1924933199998122 = 1.2498468800002911

'for res in myset: break': 2.9994213600000421
Adjusting for local name lookup (myset):
2.9994213600000421 - 1.1086400799999865 = 1.8907812800000556

'res=myset()': 2.2228832400001011
Adjusting for local name lookup
2.2228832400001011 - 1.1086400799999865 = 1.1142431600001146

----------------------

To summarize:
'res=iter(myset).next()': 1.9545627200000126 / 2.0626001600004429
'res=myset.next()': 1.2498468800002911 / 1.2498468800002911
'for res in myset: break': 1.8907812800000556
'res=myset()': 1.1142431600001146

----------------------

To conclude, 'for res in myset: break' is actually not much faster than 
'res=iter(myset).next()' except the former saves a lot of name lookup. 
The problem with 'res=iter(myset).next()' is too many name lookup and 
creating iter() object.

The fastest method is 'res=myset()' which eliminates the name lookup, it 
is twice as fast as any other methods after all the overheads are 
eliminated.

DISCLAIMER: I cannot guarantee there aren't any mistake.

PS: The result of the benchmark must be taken with a grain of salt. It 
is only apparent after 10000000 (10**7) iteration, which means a second 
difference is only 10**-7 difference in reality.