[Patches] [ python-Patches-629637 ] Add a sample selection method to random.py
noreply@sourceforge.net
noreply@sourceforge.net
Fri, 08 Nov 2002 04:37:46 -0800
Patches item #629637, was opened at 2002-10-27 21:03
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=629637&group_id=5470
Category: Library (Lib)
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Raymond Hettinger (rhettinger)
Assigned to: Tim Peters (tim_one)
Summary: Add a sample selection method to random.py
Initial Comment:
random.randset(n, k) returns a k length list of unique
integers in the range [0,n).
Improves on a Cookbook submission by using the
parameters to select between a shuffle algorithm and
a dictionary algorithm.
I want to add this to the library because it is a simple,
robust solution to a general selection problem and
because it isn't obvious that two different algorithms
are needed to balance speed/space trade-offs.
If approved, will add docs and a news item.
----------------------------------------------------------------------
>Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-08 07:37
Message:
Logged In: YES
user_id=80475
I use the routine for transaction testing in audit work.
The random order is useful so that subslices of the result
are also valid random samples. I run a sample of 60, test
the first 25, if an error is found, the sample expands to 60,
and if more errors are found, the transaction set does not
pass the audit.
The cookbook poster also needed the routine in his work
and wanted it badly enough to make an excrutiating
tranlation from old Fortran code from a textbook. To save
bungled re-inventions of the wheel, I crafted a cleaner
solution than either my quick and dirty or his translated
version.
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2002-11-08 03:59
Message:
Logged In: YES
user_id=31435
Well, you're in murky waters because it's a "new feature"
patch rather than a bugfix, and wasn't vetted on Python-
Dev or c.l.py or via PEP first, nor is it a function in wide use
already, neither one that people have asked for in
the "small feature requests" PEP. It appeared out of the
blue, and "unsolicited"/undiscussed new features are
*usually* hard sells. The alternative is boundless bloat.
Python went for years without random.shuffle(), and that
got added because (a) at any given moment, you were
likely to find a c.l.py discussion about someone's incorrect
Python code for shuffling; and, (b) how to shuffle was a very
popular FAQ on the Tutor list. So the demand, and the
difficulty of rolling your own, were compellingly clear at the
time.
In contrast, people asking how to get a random k-
combination are almost conspicuous by absence, which
makes the "very common use" claim hard to buy when
viewed against the Python community as a whole.. The
handful (an exaggeration -- I only wish there were 5 <wink>)
who egged me into writing CombGen.py at the time wanted
much more than *just* that, and CombGen tried to meet all
expressed desires at the time. I have to agree with Martin
that people who would use this also want a lot of related
stuff (I'm one of them).
Some of the design decisions here remain unclear. Where
CombGen went out of its way to guarantee that
combinations are always delivered in "ascending" order, you
seem to want to guarantee that they appear in a random
order. Why? Especially since you view these as index
vectors, ascending order gives the best shot at locality of
reference when the user does the indirect indexing bit.
People who intend to use the result as a random starting
point into the lexicographic or Gray code ordering of k-
combinations also need ascending order.
CombGen never went into the std library because I never
made an attempt to put it there: CombGen never
attracted a signficant audience, and I'm not keen to push
things into the library that, as far as I can tell, only a few
people use.
Since that's the std I hold myself to, it's also the std I'm
inclined to hold others to.
In the absence of being able to point to potential users from
c.l.py threads, let me ask why *you* wrote it. Did you have
an actual app that needed this function (and if so, what was
it), or was it more of an interesting programming exercise?
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-11-08 01:52
Message:
Logged In: YES
user_id=21627
Well, I agree that the patch is correct in the sense of
doing what it says it does. What I cannot judge is whether
the feature is useful; it looks like bloat to me.
I could be convinced if you find a user of this function (or
the Cookbook recipe) who says I use it for this and that,
and I would prefer to see it in the library for that reason,
instead of copying it from the Cookbook.
I have the feeling that anybody who would use such a
function would also use ten other "standard" functions which
are not included in the library at the moment. So that
person would not be helped with getting the single function;
he would need an entirely new library of such things.
So I would propose that you withdraw the patch.
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-08 01:36
Message:
Logged In: YES
user_id=80475
I'm re-learning to hate the patch process. This was a
straight-forward, thoroughy tested, useful patch. Getting it
accepted wasn't supposed to be hard.
What is the next step -- Take it as is, convert the n
argument to choice() style population list, or withdraw the
patch?
----------------------------------------------------------------------
Comment By: Guido van Rossum (gvanrossum)
Date: 2002-11-07 19:55
Message:
Logged In: YES
user_id=6380
I'm not even looking at this, I'm delegating this to Tim. He
knows infinitely more about random and permutations than I
do, and he's actually used this stuff.
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-07 18:30
Message:
Logged In: YES
user_id=80475
Assigned to GvR for pronouncement on
a) whether he agrees that a sampling function is useful.
b) whether to implement it as is or with sequence
arguments
c) whether to leave it in random or put in another module.
The current form returns a list of integers that can be used
directly or as indices into a sequence. The advantages are
flexibility in use and the ability to pick a hundred elements
out of ten million without building a long list first. The
approach is essentially a uniquified list of calls to
randrange(). Tim prefers an approach that parallels
random.choice() where the call looks like:
random.sample([a,b,c,d,e], 2) # picks 2 of the 5 objects
I think the function belongs in the random module since it
is a primary use of random numbers (just like shuffle()
and choose()). Tim prefers to have a separate library
module that has a whole grab bag of combinatorics.
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-05 16:38
Message:
Logged In: YES
user_id=80475
Thanks for the quick follow-ups.
The switchover ratio of six came from counting pointers
and longs. Shuffling uses an n length list at one pointer
for each element. The dictionary approach has k
elements with a hash code, a key pointer, and a value
pointer for a total of three multiplied by 1.5 and rounding
up to five (because dict loading is kept under 2/3) and one
pointer for the 'inorder' return list for a total of six. Also, I
liked six to minimize resampling in the dictionary
approach (keeping it under 20%).
As requested, I'll add the random argument to the
documentation.
Originally, I was going to have sample() select from an
arbitrary collection (like choose() does) but, in the end,
preferred the current approach of choosing integers. This
approach allows sample(1000000,60) without building a
giant list first. Also, converting from indices to elements is
trivial: [colorlist[i] for i in random.sample(len
(colorlist),5)].
I avoided the n/2 complement selection technique
because of use case rarity and to allow the sample itself to
be in random order (oxymoron?). If you guys think it's
necessary, I'll add a complement selection branch
followed by a call to random.shuffle(). Still, as it stands,
the code is robust, uses space no larger than a k sized
dictionary, and runs with no more than 1.2*k calls to
random().
I don't know why CombGen.py never made it to
Tools/scripts. Even if it does, I think a random sampling
function belongs in the random module where people can
find it -- it is a very common use of random numbers.
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2002-11-05 13:31
Message:
Logged In: YES
user_id=31435
I agree this is useful, but would rather see Python grow
libraries for combinatorial objects. There are many things
beyond this that are also useful, For example, the examples
you gave here were of selections from collections that aren't
range(n), and it would be more useful to more people to have
a way to choose k elements from an arbitrary n-element
collection directly (like a collection of transactions, or a set of
cards, whatever).
Note that I posted a module to Python-Dev not long ago that
implements such stuff (CombGen.py), along with other useful
functions on combinations.
Note that when k > n/2, "the usual trick" isn't to shuffle a list,
but to generate a complement selection. For example, if you
want a random sample of 9999 out of 10000, it's a lot more
efficient to pick the single element that's *not* in the result.
See CombGen for code to do this.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-11-05 12:34
Message:
Logged In: YES
user_id=21627
Thanks for the explanation. On to the implementation: How
did you arrive at the factor of 6 between a dictionary and a
list?
The documentation should mention the random optional argument.
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-05 10:36
Message:
Logged In: YES
user_id=80475
Like shuffle() and choose(), random sampling without
replacement is one of the core principal use cases for
random numbers.
Acceptance testing often requires a fixed number of non-
overlapping samples i.e. Selecting 60 transactions out of
a 1000 and finding zero errors yields a 95% confidence
that the population has less than a 5% error rate.
Some simulations also need groups of non-overlapping
samples i.e. a lottery result of six unique numbers
selected from a range of 1 to 57. An electronic raffle picks
consecutive winners without allowing previous winners to
be reselected.
While sampling with replacement is trivial to implement
with a list comprehension, sampling without replacement
has a number of implementation nuances that makes it
worthwhile to have a robust solution already implemented
in the random library.
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-11-05 03:27
Message:
Logged In: YES
user_id=21627
Can you explain why this needs to be in the standard
library? I.e. what typical application would use it?
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-05 01:33
Message:
Logged In: YES
user_id=80475
Martin, do you have time to give this patch a second
review?
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-31 02:29
Message:
Logged In: YES
user_id=80475
Added new version with local variable optimization and
with the dictionary results returned in selection order.
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-28 23:42
Message:
Logged In: YES
user_id=80475
Renamed to random.sample(n,k) to show that it is used
for sampling without replacement.
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-28 23:42
Message:
Logged In: YES
user_id=80475
Renamed to random.sample(n,k) to show that it is used
for sampling without replacement.
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-28 07:54
Message:
Logged In: YES
user_id=80475
Added full patch with news item and docs.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=629637&group_id=5470