[Numpy-discussion] help creating a reversed cumulative histogram

Thu Sep 3 10:17:14 EDT 2009

On Thu, Sep 3, 2009 at 9:23 AM, Tim
Michelsen<timmichelsen at gmx-topmail.de> wrote:
>>
> Hello,
> I have checked the snippets you proposed.
> It does what I wanted to achieve.
> Obviously, I had to substract the values as Robert
> demonstrated. This could also be perceived from
> the figure I posted.
>
> I still have see how I can optimise the code
> (c.f. below) or modify to be less complicated.
> It seemed so simple in the spreadsheet...
>
>> eisf_sums = ecdf_sums[-1] - ecdf_sums
>> # empirical inverse survival

this should have inverse in it, it was a cut and paste error

empirical survival function would be just 1-ecdf

however, as distributions they would require to be normed to 1,

>> function of weights
> Can you recommend me a (literature) source where
> I can look up this term?
> I learned statistics in my mother tongue and seem
> to need a refresher on distributions...
> I would like to come up with the right terms
> next time.

My first stop is usually wikipedia:

http://en.wikipedia.org/wiki/Survival_function
http://de.wikipedia.org/wiki/Verteilungsfunktion#.C3.9Cberlebenswahrscheinlichkeit

and the ISI - INTERNATIONAL STATISTICAL INSTITUTE glossary for terms
in different languages
http://isi.cbs.nl/glossary/bloken83.htm

>
>> Are you sure you want cumulative weights in
>>the histogram?
> You mean it doesn't make sense at all?

It depends on what you want, ecdf as it is calculated, with the
weights argument in the histogram, gives you the cumulative sum of the
values, not the count.
In the case of the weight of pigs, it would be to cumulative weight of
all pigs with a weight less than the given bin boundary weight.
If values were income, then it would be the aggregated income of all
individual with an income below the bin bin boundary.
So it makes sense, given this is what you want (below).

>
> I need:
> 1) the count of occurrences sorted in each bin
>    counts = np.histogram(values,
>                                    normed=normed,
>                                    bins=bins)
>    => here I obtain now the same as in the
>    spreadsheet
>
> 2) the sum of all values sorted in each bin
>    sums = np.histogram(values, weights=values,
>                                    normed=normed,
>                                    bins=bins)
>

>    => here I still obtain different values for the first
>    histogram value (eisf_sums[0]):
>    Numpy: eisf_sums
>    335.50026738, 319.21363636, 266.07724942,
>    198.10258741, 126.69270396, 67.98125874,
>    38.47335664,  24.75062937, 13.42121212,
>    2.48636364, 0.
>
>    Spreadsheet:
>    335.2351159, 319.2136364, 266.0772494,
>    198.1025874, 126.692704, 67.98125874,
>    38.47335664, 24.75062937, 13.42121212,
>    2.486363636, 0

there might be a mistake in the treatment of a cell when
reversing, when I run your example the highest value is
not equal to values.sum()

this might match the spreadsheet, but I haven't compared
isf = sums[0][::-1].cumsum()[::-1]

But I'm not sure yet, what's going on.

Josef

>
> Additionally, I would like to see these implemented
> as convenience functions in numpy or scipy.
> There should be out of the box functions for all kinds
> of distributions.
> Where is the best place to contrubute a final version?
> The scipy.stats?
>
> Thanks again for your input,
> Timmie
>
> ##### below the distilled code #####
> ## histogram settings
> normed = False
> bins = 10
>
> ## counts: gives expected results
> counts = np.histogram(values,
>                                    normed=normed,
>                                    bins=bins)
>
> ecdf_counts = np.hstack([1.0, counts[0].cumsum() ])
> ecdf_inv_counts = ecdf_counts[::-1]
> # empirical inverse survival function of weights
> eisf_counts = ecdf_counts[-1] - ecdf_counts
>
>
> ### sum: does have deviations
> sums = np.histogram(values, weights=values,
>                                    normed=normed,
>                                    bins=bins)
> ecdf_sums = np.hstack([1.0, sums[0].cumsum() ])
> ecdf_inv_sums = ecdf_sums[::-1]
> # empirical inverse survival function of weights
> eisf_sums = ecdf_sums[-1] - ecdf_sums
>
> ##
> # configure plot
> xlabel = 'Bins'
> ylabel_left = 'Counts'
> ylabel_right = 'Sum'
>
>
> fig1 = plt.figure()
> ax1 = fig1.add_subplot(111)
>
> # counts
> ax1.plot(counts[1], ecdf_inv_counts, 'r-')
> ax1.set_xlabel(xlabel)
> ax1.set_ylabel(ylabel_left, color='b')
> for tl in ax1.get_yticklabels():
>    tl.set_color('b')
>
> # sums
> ax2 = ax1.twinx()
> ax2.plot(sums[1], eisf_sums, 'b-')
> ax2.set_ylabel(ylabel_right, color='r')
> for tl in ax2.get_yticklabels():
>    tl.set_color('r')
> plt.show()
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>