selecting a random item from a set

Tim Peters tim.one at comcast.net
Wed Dec 31 02:43:15 EST 2003


[Tim, asks which algorithms require random selection, as opposed
 to arbitrary selection, from a set]

[Alexander Schmolck]
> Feature selection schemes, for example?

Eh?  That's like "numerical methods" <wink/frown>.  IOW, too vague.

>> (it can be done using O(1) space instead,

> s/zip/izip/ s/range/xrange/ ?

Yes, but that's still got a ton of overhead due to the Python-level looping.
You can do it in O(1) space now at essentially full C speed via, e.g.,

    i = random.randrange(len(some_set))
    random_elt = itertools.islice(some_set, i, None).next()
    some_set.remove(random_elt)

That tricks it into finding the starting point with a C loop, and then the
iterator is discarded after extracting its first element.

But (of course) that still takes average time proportional to the # of
elements in the set.  Well, that wishes away a detail:  this can actually be
arbitrarily expensive, as there's no bound on how sparse a Python dict can
get, and the C dict iteration code has to visit every slot, occupied or not,
to find "the next" occupied slot.  So, e.g., it's possible that selecting
the only element in a 1-element set requires millions of iterations at the C
level.  That's unlikely, just possible (a quirk of CPython's dict
implementation is that the implementation data structure never shrinks when
you delete a key; it may shrink when you add a new key, though!).

> ...
> I'm not entirely bitten by the premature optimization bug however --
> I know that most of the time in an existing implementation of what I
> want to do in another language is spent doing these set operations
> and takes minutes to hours to come up with its outputs.

Which other language; how does it implement sets; is the time in that
language consumed by picking random elements from sets, or by other set
operations ("these" isn't really clear); and how big are these sets?

If the sets are small, you're not going to beat random.choice(tuple(s)).  If
the sets are large, low-level optimization of an approach with the wrong O()
behavior is just a monumentally bad idea.  If the sets are of middling size,
get more ambitious <wink>.






More information about the Python-list mailing list