[Python-Dev] Retrieve an arbitrary element from asetwithoutremoving it

Fri Nov 6 15:17:59 CET 2009

On Thu, Nov 5, 2009 at 5:02 PM, Raymond Hettinger <python at rcn.com> wrote:

> def first(iterable):
>   'Return the first value from a container or iterable.  If empty, raises
> LookupError'
>   for value in iterable:
>       return value
>   raise LookupError('no first value; iterable is empty')
>

(This email should not be construed to mean that I support adding a .pick
method; I do not.)

I realized this morning that the "for i in s: break" idiom can be O(n), even
assuming no hash collisions.  Observe this subtly O(n**2) algorithm:

Cashew:~$ cat > test.py
while s:
    for i in s:
        break
    # Imagine a complex algorithm here that depends on i still being in s
    s.remove(i)

Cashew:~$ python2.5 /usr/lib/python2.5/timeit.py -s 's=set(range(20000))'
"`cat test.py`"
10 loops, best of 3: 31.2 msec per loop
Cashew:~$ python2.5 /usr/lib/python2.5/timeit.py -s 's=set(range(40000))'
"`cat test.py`"
10 loops, best of 3: 124 msec per loop
Cashew:~$ python2.5 /usr/lib/python2.5/timeit.py -s 's=set(range(80000))'
"`cat test.py`"
10 loops, best of 3: 491 msec per loop
Cashew:~$ python2.5 /usr/lib/python2.5/timeit.py -s 's=set(range(160000))'
"`cat test.py`"
10 loops, best of 3: 1.96 sec per loop

Doubling n quadruples the run-time, i.e., O(n**2) behavior.

I wonder if those clamoring for a pick() method and complaining about
efficiency are inadvertently running into this?

The problem arises because we're systematically unbalancing the hash table.
The iterator returns the first valid entry in the hash table, which we
remove.  Repeat several times and pretty soon the iterator has to spend O(n)
time scanning through empty entries to find the first remaining valid entry.

On the other hand, the pop/add idiom is O(1), since the .pop implementation
contains some voodoo to remember where it left off.  Consequently, the
following algorithm is O(n) instead of O(n**2):

Cashew:~$ cat > test.py
while s:
    i = s.pop()
    s.add(i)
    # Imagine a complex algorithm here that depends on i still being in s
    s.remove(i)

Cashew:~$ python2.5 /usr/lib/python2.5/timeit.py -s 's=set(range(400000))'
"`cat test.py`"
10 loops, best of 3: 28 msec per loop
Cashew:~$ python2.5 /usr/lib/python2.5/timeit.py -s 's=set(range(800000))'
"`cat test.py`"
10 loops, best of 3: 56.2 msec per loop
Cashew:~$ python2.5 /usr/lib/python2.5/timeit.py -s 's=set(range(1600000))'
"`cat test.py`"
10 loops, best of 3: 113 msec per loop
Cashew:~$ python2.5 /usr/lib/python2.5/timeit.py -s 's=set(range(3200000))'
"`cat test.py`"
10 loops, best of 3: 227 msec per loop

Doubling n doubles the run-time, i.e., O(n) behavior.

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20091106/0a03f761/attachment.htm>