[Numpy-discussion] np.histogram: upper range bin

Peter Butterworth butterw at gmail.com
Sun Jun 12 16:46:48 EDT 2011


Consistent bin width is important for my applications. With floating
point numbers I usually shift my bins by a small offset to ensure
values at bin edges always fall in the correct bin.
With the current np.histogram behavior you _silently_ get a wrong
count in the top bin if a value falls on the upper bin limit.
Incidentally this happens by default with integers. ex: x=range(4);
np.histogram(x)

I guess it is better to always specify the correct range, but wouldn't
it be preferable if the function provided a
warning when this case occurs ?

Likely I will test for the following condition when using np.histogram(x):
max(x) == top bin limit

---
Re: [Numpy-discussion] np.histogram: upper range bin
Christopher Barker
Thu, 02 Jun 2011 09:19:16 -0700

Peter Butterworth wrote:
> in np.histogram the top-most bin edge is inclusive of the upper range
> limit. As documented in the docstring (see below) this is actually the
> expected behavior, but this can lead to some weird enough results:
>
> In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3)
> Out[72]: (array([1, 1, 2]), array([ 1.,  2.,  3., 4.]))
>
> Is there any way round this or an alternative implementation without
> this issue ?

The way around it is what you've identified -- making sure your bins are
right. But I think the current behavior is the way it "should" be. It
keeps folks from inadvertently loosing stuff off the end -- the lower
end is inclusive, so the upper end should be too. In the middle bins,
one has to make an arbitrary cut-off, and put the values on the "line"
somewhere.


One thing to keep in mind is that, in general, histogram is designed for
floating point numbers, not just integers -- counting integers can be
accomplished other ways, if that's what you really want (see
np.bincount). But back to your example:

 > In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3)

Why do you want only 3 bins here? using 4 gives you what you want. If
you want more control, then it seems you really want to know how many of
each of the values 1,2,3,4 there are. so you want 4 bins, each
*centered* on the integers, so you might do:

In [8]: np.histogram(x, bins=4, range=(0.5, 4.5))
Out[8]: (array([1, 1, 1, 1]), array([ 0.5,  1.5,  2.5,  3.5,  4.5]))

or, if you want to be more explicit:

In [14]: np.histogram(x, bins=np.linspace(0.5, 4.5, 5))
Out[14]: (array([1, 1, 1, 1]), array([ 0.5,  1.5,  2.5,  3.5,  4.5]))


HTH,

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception



More information about the NumPy-Discussion mailing list