[Python-ideas] random.sample should work better with iterators

Tue Jun 26 23:07:58 EDT 2018

Steven D'Aprano writes:

 > > I don't know if Python Ideas is the right channel for this, but this seems 
 > > overly constrained. The inability to handle dictionaries is especially 
 > > puzzling.
 > 
 > Puzzling in what way?

Same misconception, I suppose.

 > If sample() supported dicts, should it return the keys or the values or 
 > both?

I argue below that *if* we were going to make the change, it should be
to consistently try list() on non-sequences.  But "not every
one-liner" and EIBTI:

d = {'a': 1, 'b': 2}

>>> sample(d.keys(),1)
['a']
>>> sample(d.items(),1)
[('a', 1)]

But this is weird:

>>> sample(d.values(),1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/random.py", line 314, in sample
    raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
TypeError: Population must be a sequence or set.  For dicts, use list(d).

Oh, I see.  Key views are "set-like", item views *may* be set-like,
but value views are *not* set-like.

Since views are all listable, why not try "list" on them?  In general,
I would think it makes sense to define this as "Population must be a
sequence or convertible to a sequence using list()."  And for most of
the applications I can think of in my own use, sample(list(d)) is not
particularly useful because it's a sample of keys.  I usually want
sample(list(d.values())).

The ramifications are unclear to me, but I guess it's too late to
change this because of the efficiency implications Tim describes in
issue33098 (so EIBTI; thanks for the reference!)  On the other hand,
that issue says sets can't be sampled efficiently, so the current
behavior seems to *promote* inefficient usage?

I would definitely change the error message.  I think "Use list(d)" is
bad advice because I believe it's not even "almost always" what you'll
want, and if keys and values are of the same type, it won't be obvious
from the output that you're *not* getting a sample from d.values() if
that's what you wanted and thought you were getting.

 > Don't let the source speak for itself. Explain what it means. I 
 > understand what sample(population, size=100) does. What would 
 > sample(population, ratio=0.25) do?

I assume sample(pop, ratio=0.25) == sample(pop, size=0.25*len(pop)).