[Patches] [ python-Patches-629637 ] Add a sample selection method to random.py

Fri, 08 Nov 2002 12:05:18 -0800

Patches item #629637, was opened at 2002-10-27 21:03
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=629637&group_id=5470

Category: Library (Lib)
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Raymond Hettinger (rhettinger)
Assigned to: Raymond Hettinger (rhettinger)
Summary: Add a sample selection method to random.py

Initial Comment:
random.randset(n, k) returns a k length list of unique 
integers in the range [0,n).

Improves on a Cookbook submission by using the 
parameters to select between a shuffle algorithm and 
a dictionary algorithm.

I want to add this to the library because it is a simple, 
robust solution to a general selection problem and 
because it isn't obvious that two different algorithms 
are needed to balance speed/space trade-offs.

If approved, will add docs and a news item.

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2002-11-08 15:05

Message:
Logged In: YES 
user_id=31435

I'd rather you went back to the original scheme -- as 
a "speed-freak basic building block", sticking to implicit 
range(n) was clear, and nobody who wants that behavior is 
going to guess that passing xrange(n) might work in the 
new scheme.

If random order is a promise of this method, than that must 
be documented.  As is, the docs are silent about order, so 
any order meets the spec.  If it's important that it be 
random, then the docs have to constrain implementations; 
if it's not important, you can't use it as an argument <wink>.

The return type isn't documented and should be, esp. if you 
want to stick to the new scheme.  That it always returns a 
list will be surprising (if I pass, e.g., a string, I *expect* a 
string of length k to come back; or if a tuple, a tuple of 
length k, etc. -- this became clear from combgen's users, 
and is another reason sticking to the basic building block 
function is better -- we put this in, and next thing is a 
feature request to return a sequence of the same type as 
the input).

Comments about use case subtleties, and algorithm 
obscurities, belong in the docs and in code comments more 
than in patch comments.

You surely don't want to hear this next one <wink>, but the 
patch appears to be missing test cases.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-08 14:43

Message:
Logged In: YES 
user_id=80475

P.S.  The code continues to use the index list internally. 
This leaves the original pool unmolested and allows the 
use of xrange(n) as an argument.

By not using the population elements as dictionary keys, 
no assumptions need to be made about the uniqueness of 
the population list.  A weighted population is valid:
  sample('red red red blue blue'.split(), 3)

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-08 14:23

Message:
Logged In: YES 
user_id=80475

As requested, revised patch to accept a population 
sequence instead of an index range.

Now that xrange() is fixed (a separate issue), this patch 
will also serve to choose from large integer sequences 
without building the whole sequence first:   sample(xrange
(10000000), 60).

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2002-11-08 13:20

Message:
Logged In: YES 
user_id=31435

Guido, you may recall that you used combgen in the 
Mankato project (to generate random, non-overlapping 5(?)-
word "fingerprints" from email msgs).  There are certainly 
valid uses for this stuff, and good algorithms aren't easy.

combgen resolved the range(n) vs sequence "dilemma" by 
providing both, where the former was primarily for speed 
freaks, and the latter was implemented via has-a of the 
former.  Both are useful, and the former is *essential* in 
some cases (e.g., picking 3 out of a billion -- as Raymond 
says, you can't well materialize an explicit list of a billion 
elements first).  So as a basic building block, range(n) is 
more useful.  OTOH, users often don't see how to build 
what they want out of basic blocks.

About random vs sorted, Raymond provided a plausible 
use case.  Nobody brought that up when I was doing 
combgen, but it's another thing different apps may want 
done differently.  Purely from an efficiency view, it's quicker 
not to guarantee ascending order (combgen sorts under 
the covers), so in that way Raymond's range(n) gimmick is 
even more of a speed-freak basic building block than 
combgen's CombGenBasic class.

It's always a puzzle figuring out where things belong.  
combgen didn't start life doing random combinations -- it 
started because merely computing the number of k-
combinations (of n things) *is* a frequent question (how 
many poker hands are there?  bridge hands?), and an 
efficient algorithm for computing that isn't obvious either.  
Start from there, and it's soon apparent that there are 
many algorithms involving combinations, so much so that if 
you're working in this area, a class capturing the concept is 
very useful.

Ideally, Python would have a package for combinatorial 
objects, and modules therein would tackle combinations, 
permutations, partitions, and possibly basic graph 
algorithms.  combgen was meant to be a start at that, but 
it ended there too.

So that's a mild dilemma:  if we put one of these in, a small 
but probably growing user base will want "more of the 
same", and random.py isn't even arguably the right place to 
put any of the rest.

As to how straightforward even this is, I expect this is the 
only patch in Python history to have 10 versions attached 
<wink>.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-11-08 12:06

Message:
Logged In: YES 
user_id=6380

Tim's code is at

http://mail.python.org/pipermail/python-dev/2002-August/028399.html

If you really need the selection in random order, wouldn't
it make more sense to apply shuffle() to the resulting list?
(Applying sort() to the list if you don't want it randomized
seems backwards.)

I do find returing a list of indices less intuitive than a
list of elements.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-08 09:01

Message:
Logged In: YES 
user_id=80475

FWIW, I did try out the complement selection method for 
k>n/2 but found that it improved performance in some 
cases and worsened it in others.  More importantly, it 
interfered with the goal of returning the selections in 
random order.  Select 10 raffle winners, give a grand prize, 
2 second prizes, 3 third prizes, and 4 fourth prizes -- the 
results must be in random order so that the grand prize is 
not biased by a non-random ordering.

If everyone prefers sample(sequence, k) to sample(n,k), I 
will be happy to change it.

If Tim wants to send me some code to study, that's cool.  I 
always learn something from reading his code.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-11-08 08:19

Message:
Logged In: YES 
user_id=6380

Still, the question remains, why are all these functions so
disconnected in their interface. Why does shuffle() take an
optional random() function as argument? Why doesn't sample()
take a list from which it returns a sample? Why isn't
sample() a generator? Etc. These aren't necessarily good
questions, but without trying to use these functions, I
can't tell. The APIs look pretty random. Maybe the random()
module is destined to be a random collection of useful
statistical hacks? It already looks like that to me now. If
that's the case, I'm not against adding some more, but I
wish that Raymond would look at Tim's code and suggestions
(e.g. complement selection for k > n/2). It does seem to me
that a *random* sample falls in the same category as Tim's
"generate all samples" code though, so arguably Raymond's
sample() would belong in random.py even if CombGen.py were
in the standard library. Also consider that many uses of
random() are inspired by education -- for some reason,
teachers like to teach programming using the random()
function and its derivatives to write simple games (number
guessing), visual effects (brownian motion) and more.
random.sample() might well fit in that category. Another
potential use category could be simple applied statistics,
like Raymond's transaction testing. It seems that such
things fill some kind of need (otherwise there wouldn't be
two cookbook recipes for it).

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-08 07:37

Message:
Logged In: YES 
user_id=80475

I use the routine for transaction testing in audit work.

The random order is useful so that subslices of the result 
are also valid random samples.  I run a sample of 60, test 
the first 25, if an error is found, the sample expands to 60, 
and if more errors are found, the transaction set does not 
pass the audit.

The cookbook poster also needed the routine in his work 
and wanted it badly enough to make an excrutiating 
tranlation from old Fortran code from a textbook.  To save 
bungled re-inventions of the wheel, I crafted a cleaner 
solution than either my quick and dirty or his translated 
version.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2002-11-08 03:59

Message:
Logged In: YES 
user_id=31435

Well, you're in murky waters because it's a "new feature" 
patch rather than a bugfix, and wasn't vetted on Python-
Dev or c.l.py or via PEP first, nor is it a function in wide use 
already, neither one that people have asked for in 
the "small feature requests" PEP.  It appeared out of the 
blue, and "unsolicited"/undiscussed new features are 
*usually* hard sells.  The alternative is boundless bloat.

Python went for years without random.shuffle(), and that 
got added because (a) at any given moment, you were 
likely to find a c.l.py discussion about someone's incorrect 
Python code for shuffling; and, (b) how to shuffle was a very 
popular FAQ on the Tutor list.  So the demand, and the 
difficulty of rolling your own, were compellingly clear at the 
time.

In contrast, people asking how to get a random k-
combination are almost conspicuous by absence, which 
makes the "very common use" claim hard to buy when 
viewed against the Python community as a whole..  The 
handful (an exaggeration -- I only wish there were 5 <wink>) 
who egged me into writing CombGen.py at the time wanted 
much more than *just* that, and CombGen tried to meet all 
expressed desires at the time.  I have to agree with Martin 
that people who would use this also want a lot of related 
stuff (I'm one of them).

Some of the design decisions here remain unclear.  Where 
CombGen went out of its way to guarantee that 
combinations are always delivered in "ascending" order, you 
seem to want to guarantee that they appear in a random 
order.  Why?  Especially since you view these as index 
vectors, ascending order gives the best shot at locality of 
reference when the user does the indirect indexing bit.

People who intend to use the result as a random starting 
point into the lexicographic or Gray code ordering of k-
combinations also need ascending order.

CombGen never went into the std library because I never 
made an attempt to put it there:  CombGen never 
attracted a signficant audience, and I'm not keen to push 
things into the library that, as far as I can tell, only a few 
people use.

Since that's the std I hold myself to, it's also the std I'm 
inclined to hold others to.

In the absence of being able to point to potential users from 
c.l.py threads, let me ask why *you* wrote it.  Did you have 
an actual app that needed this function (and if so, what was 
it), or was it more of an interesting programming exercise?

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-11-08 01:52

Message:
Logged In: YES 
user_id=21627

Well, I agree that the patch is correct in the sense of
doing what it says it does. What I cannot judge is whether
the feature is useful; it looks like bloat to me.

I could be convinced if you find a user of this function (or
the Cookbook recipe) who says I use it for this and that,
and I would prefer to see it in the library for that reason,
instead of copying it from the Cookbook.

I have the feeling that anybody who would use such a
function would also use ten other "standard" functions which
are not included in the library at the moment. So that
person would not be helped with getting the single function;
he would need an entirely new library of such things.

So I would propose that you withdraw the patch.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-08 01:36

Message:
Logged In: YES 
user_id=80475

I'm re-learning to hate the patch process.  This was a 
straight-forward, thoroughy tested, useful patch.  Getting it 
accepted wasn't supposed to be hard.

What is the next step  --  Take it as is, convert the n 
argument to choice() style population list, or withdraw the 
patch?

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-11-07 19:55

Message:
Logged In: YES 
user_id=6380

I'm not even looking at this, I'm delegating this to Tim. He
knows infinitely more about random and permutations than I
do, and he's actually used this stuff.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-07 18:30

Message:
Logged In: YES 
user_id=80475

Assigned to GvR for pronouncement on
a) whether he agrees that a sampling function is useful.
b) whether to implement it as is or with sequence 
arguments
c) whether to leave it in random or put in another module.

The current form returns a list of integers that can be used 
directly or as indices into a sequence.  The advantages are 
flexibility in use and the ability to pick a hundred elements 
out of ten million without building a long list first.  The 
approach is essentially a uniquified list of calls to 
randrange().  Tim prefers an approach that parallels 
random.choice() where the call looks like:
   random.sample([a,b,c,d,e], 2)  # picks 2 of the 5 objects

I think the function belongs in the random module since it 
is a primary use of random numbers (just like shuffle() 
and choose()).  Tim prefers to have a separate library 
module that has a whole grab bag of combinatorics.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-05 16:38

Message:
Logged In: YES 
user_id=80475

Thanks for the quick follow-ups.

The switchover ratio of six came from counting pointers 
and longs.  Shuffling uses an n length list at one pointer 
for each element.  The dictionary approach has k 
elements with a hash code, a key pointer, and a value 
pointer for a total of three multiplied by 1.5 and rounding 
up to five (because dict loading is kept under 2/3) and one 
pointer for the 'inorder' return list for a total of six.  Also, I 
liked six to minimize resampling in the dictionary 
approach (keeping it under 20%).

As requested, I'll add the random argument to the 
documentation.

Originally, I was going to have sample() select from an 
arbitrary collection (like choose() does) but, in the end, 
preferred the current approach of choosing integers.  This 
approach allows sample(1000000,60) without building a 
giant list first.  Also, converting from indices to elements is 
trivial:   [colorlist[i] for i in random.sample(len
(colorlist),5)].  

I avoided the n/2 complement selection technique 
because of use case rarity and to allow the sample itself to 
be in random order (oxymoron?).  If you guys think it's 
necessary, I'll add a complement selection branch 
followed by a call to random.shuffle().  Still, as it stands, 
the code is robust, uses space no larger than a k sized 
dictionary, and runs with no more than 1.2*k calls to 
random().

I don't know why CombGen.py never made it to 
Tools/scripts. Even if it does, I think a random sampling 
function belongs in the random module where people can 
find it -- it is a very common use of random numbers.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2002-11-05 13:31

Message:
Logged In: YES 
user_id=31435

I agree this is useful, but would rather see Python grow 
libraries for combinatorial objects.  There are many things 
beyond this that are also useful,  For example, the examples 
you gave here were of selections from collections that aren't 
range(n), and it would be more useful to more people to have 
a way to choose k elements from an arbitrary n-element 
collection directly (like a collection of transactions, or a set of 
cards, whatever).

Note that I posted a module to Python-Dev not long ago that 
implements such stuff (CombGen.py), along with other useful 
functions on combinations.

Note that when k > n/2, "the usual trick" isn't to shuffle a list, 
but to generate a complement selection.  For example, if you 
want a random sample of 9999 out of 10000, it's a lot more 
efficient to pick the single element that's *not* in the result.  
See CombGen for code to do this.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-11-05 12:34

Message:
Logged In: YES 
user_id=21627

Thanks for the explanation. On to the implementation: How
did you arrive at the factor of 6 between a dictionary and a
list?

The documentation should mention the random optional argument.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-05 10:36

Message:
Logged In: YES 
user_id=80475

Like shuffle() and choose(), random sampling without 
replacement is one of the core principal use cases for 
random numbers.

Acceptance testing often requires a fixed number of non-
overlapping samples i.e.  Selecting 60 transactions out of 
a 1000 and finding zero errors yields a 95% confidence 
that the population has less than a 5% error rate.

Some simulations also need groups of non-overlapping 
samples i.e. a lottery result of six unique numbers 
selected from a range of 1 to 57.  An electronic raffle picks 
consecutive winners without allowing previous winners to 
be  reselected.

While sampling with replacement is trivial to implement 
with a list comprehension, sampling without replacement 
has a number of implementation nuances that makes it 
worthwhile to have a robust solution already implemented 
in the random library.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-11-05 03:27

Message:
Logged In: YES 
user_id=21627

Can you explain why this needs to be in the standard
library? I.e. what typical application would use it?

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-05 01:33

Message:
Logged In: YES 
user_id=80475

Martin, do you have time to give this patch a second 
review?

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-31 02:29

Message:
Logged In: YES 
user_id=80475

Added new version with local variable optimization and 
with the dictionary results returned in selection order.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-28 23:42

Message:
Logged In: YES 
user_id=80475

Renamed to random.sample(n,k) to show that it is used 
for sampling without replacement.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-28 23:42

Message:
Logged In: YES 
user_id=80475

Renamed to random.sample(n,k) to show that it is used 
for sampling without replacement.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-28 07:54

Message:
Logged In: YES 
user_id=80475

Added full patch with news item and docs.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=629637&group_id=5470