random.choices() Suggest that the code confirm that cum_weights sequence is in ascending order

Mon May 14 09:06:03 EDT 2018

On 14 May 2018 at 13:53, Chris Angelico <rosuav at gmail.com> wrote:
> On Mon, May 14, 2018 at 10:49 PM, Paul Moore <p.f.moore at gmail.com> wrote:
>> On 14 May 2018 at 13:27, Chris Angelico <rosuav at gmail.com> wrote:
>>> On Mon, May 14, 2018 at 9:59 PM, Paul Moore <p.f.moore at gmail.com> wrote:
>>>> The problem is that supplying cum_weights allows the code to run in
>>>> O(log n) by using bisection. This is significantly faster on large
>>>> populations. Adding a test that the cumulative weights are
>>>> nondecreasing would add an O(n) step to the code.
>>>>
>>>
>>> Hang on - are the 'n' and 'log n' there referring to the same n?
>>
>> Yes. The number of elements in the sample population (which is the
>> same as the number of entries in the weights/cum_weights arrays).
>
> Okay, cool. Thanks. I was a little confused as to whether the weights
> were getting grouped up or not. Have seen too many cases where someone
> panics about an O(n²) on a tiny n that's unrelated to the important
> O(n) on a huge n :)

Yeah, for all of *my* uses of the functions in random, n is so small
as to make all this irrelevant. But when I looked into how cum_weights
worked, I realised it's aimed at people passing significant sized data
sets. An they would probably be hit hard by a change from O(log n) to
O(n).

One thing I always liked about C++ was the way the standard library
documented a lot of the O(n) properties of the operations. It not only
made it easier to know what was costly and what wasn't, it also made
it much clearer what functions were intended for use on large data
sets. I sort of miss that information in Python - not least because
functions like random.choices are often a lot faster than I'd naively
expect.

Paul