choose value from custom distribution

Tue Oct 19 02:40:56 EDT 2010

elsa <kerensaelise at hotmail.com> writes:

> Hello,
>
> I'm trying to find a way to collect a set of values from real data,
> and then sample values randomly from this data - so, the data I'm
> collecting becomes a kind of probability distribution. For instance, I
> might have age data for some children. It's very easy to collect this
> data using a list, where the index gives the value of the data, and
> the number in the list gives the number of times that values occurs:
>
> [0,0,10,20,5]
>
> could mean that there are no individuals that are no people aged 0, no
> people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
> in my data collection.
>
> I then want to make a random sample that would be representative of
> these proportions - is there any easy and fast way to select an entry
> weighted by its value? Or are there any python packages that allow you
> to easily create your own distribution based on collected data? Two
> other things to bear in mind are that in reality I'm collating data
> from up to around 5 million individuals, so just making one long list
> with a new entry for each individual won't work. Also, it would be
> good if I didn't have to decide before hand what the possible range of
> values is (which unfortunately I have to do with the approach I'm
> currently working on).
>
> Thanks in advance for your help,
>
> elsa.

If you want to keep it simple, you can do:

>>> t = [0,0,10,20,5]
>>> expanded = sum([[x]*f for x, f in enumerate(t)], [])
>>> random.sample(expanded, 10)
[3, 2, 2, 3, 2, 3, 2, 2, 3, 3]
>>> random.sample(expanded, 10)
[3, 3, 4, 3, 2, 3, 3, 3, 2, 2]
>>> random.sample(expanded, 10)
[3, 3, 3, 3, 3, 2, 3, 2, 2, 3]

Is that what you need?

-- 
Arnaud