[Numpy-discussion] Changing the return type of np.histogramdd

Ralf Gommers ralf.gommers at gmail.com
Sat Apr 28 01:25:36 EDT 2018


On Wed, Apr 25, 2018 at 11:00 PM, Eric Wieser <wieser.eric+numpy at gmail.com>
wrote:

> For precision loss of the order of float64 eps, I disagree.
>
> I was thinking more about precision loss on the order of 1, for large
> 64-bit integers that can’t fit in a float64
>
It's late and I'm probably missing something, but:

>>> np.iinfo(np.int64).max > np.finfo(np.float64).max
False

Either way, such weights don't really happen in real code I think.


> Note also that #10864 <https://github.com/numpy/numpy/issues/10864>
> incurs deliberate precision loss of the order 10**-6 x smallest bin, which
> is also much larger than eps.
>
Yeah that's worse.


> It’s also possible to refer users to scipy.stats.binned_statistic
>
> That sounds like a good idea to do irrespective of whether histogramdd has
> problems - I had no idea those existed. Is there a precedent for referring
> to more feature-rich scipy functions from the basic numpy ones?
>
Yes, there are cross-links to Python, SciPy and Matplotlib functions in the
docs. This is done with intersphinx (
https://github.com/numpy/numpy/blob/master/doc/source/conf.py#L215).
Example cross-link for convolve:
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.convolve.html

Ralf



>>
> On Wed, 25 Apr 2018 at 22:51 Ralf Gommers <ralf.gommers at gmail.com> wrote:
>
>> On Wed, Apr 25, 2018 at 10:07 PM, Eric Wieser <
>> wieser.eric+numpy at gmail.com> wrote:
>>
>>> what does that gain over having the user do something like
>>> result.astype()
>>>
>>> It means that the user can use integer weights without worrying about
>>> losing precision due to an intermediate float representation.
>>>
>>> It also means they can use higher precision values (np.longdouble) or
>>> complex weights.
>>>
>> None of that seems particularly important to be honest.
>>
>> you’re emitting warnings for everyone
>>>
>>> When there’s a risk of precision loss, that seems like the responsible
>>> thing to do.
>>>
>> For precision loss of the order of float64 eps, I disagree. There will be
>> many such places in numpy and in other core libraries.
>>
>>
>>> Users passing float weights would see no warning, I suppose.
>>>
>>> is this really worth a new function
>>>
>>> There ought to be a function for computing histograms with integer
>>> weights that doesn’t lose precision. Either we change the existing function
>>> to do that, or we make a new function.
>>>
>> It's also possible to refer users to scipy.stats.binned_statistic(_2d/dd),
>> which provides a superset of the histogram functionality and is internally
>> consistent because the implementations of 1d/2d call the dd one.
>>
>> Ralf
>>
>>
>>
>>> A possible compromise: like 1, but only change the dtype of the result
>>> if a weights argument is passed.
>>>
>>> #10864 <https://github.com/numpy/numpy/issues/10864> seems like a
>>> worrying design flaw too, but I suppose that can be dealt with separately.
>>>
>>> Eric
>>>>>>
>>> On Wed, 25 Apr 2018 at 21:57 Ralf Gommers <ralf.gommers at gmail.com>
>>> wrote:
>>>
>>>> On Mon, Apr 9, 2018 at 10:24 PM, Eric Wieser <
>>>> wieser.eric+numpy at gmail.com> wrote:
>>>>
>>>>> Numpy has three histogram functions - histogram, histogram2d, and
>>>>> histogramdd.
>>>>>
>>>>> histogram is by far the most widely used, and in the absence of
>>>>> weights and normalization, returns an np.intp count for each bin.
>>>>>
>>>>> histogramdd (for which histogram2d is a wrapper) returns np.float64
>>>>> in all circumstances.
>>>>>
>>>>> As a contrived comparison
>>>>>
>>>>> >>> x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h
>>>>> array([25, 10,  8,  7], dtype=int64)>>> h, e = np.histogramdd((x*x,), bins=4); h
>>>>> array([25., 10.,  8.,  7.])
>>>>>
>>>>> https://github.com/numpy/numpy/issues/7845 tracks this inconsistency.
>>>>>
>>>>> The fix is now trivial: the question is, will changing the return type
>>>>> break people’s code?
>>>>>
>>>>> Either we should:
>>>>>
>>>>>    1. Just change it, and hope no one is broken by it
>>>>>    2. Add a dtype argument:
>>>>>       - If dtype=None, behave like np.histogram
>>>>>       - If dtype is not specified, emit a future warning recommending
>>>>>       to use dtype=None or dtype=float
>>>>>       - In future, change the default to None
>>>>>    3. Create a new better-named function histogram_nd, which can also
>>>>>    be created without the mistake that is https://github.com/numpy/
>>>>>    numpy/issues/10864.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>
>>>> (1)  sems like a no-go, taking such risks isn't justified by a minor
>>>> inconsistency.
>>>>
>>>> (2) is still fairly intrusive, you're emitting warnings for everyone
>>>> and still force people to change their code (and if they don't they may run
>>>> into a backwards compat break).
>>>>
>>>> (3) is the best of these options, however is this really worth a new
>>>> function? My vote would be "do nothing".
>>>>
>>>> Ralf
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at python.org
>>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180427/e3285e45/attachment-0001.html>


More information about the NumPy-Discussion mailing list