[Numpy-discussion] Changing the return type of np.histogramdd

Sat Apr 28 02:38:35 EDT 2018

It’s late and I’m probably missing something

The issue is not one of range as you showed there, but of precision. Here’s
the test case you’re missing:

def get_err(u64):
    """ return the absolute error incurred by storing a uint64 in a float64 ""
    u64 = np.uint64(u64)
    return u64 - u64.astype(np.float64).astype(np.uint64)

The problem starts appearing with

>>> get_err(2**53 + 1)1

and only gets worse as the size of the integers increases

>>> get_err(2**64 - 2*10)9223372036854775788  # this is a lot bigger than float64.eps (although as a relative error, it's similar)

Either way, such weights don’t really happen in real code I think.

The counterexample I can think of is someone trying to implement
fixed-precision arithmetic with large integers. The intersection of people
doing both that and histogramdd is probably very small, but it’s at least
plausible.

Yes, there are cross-links to Python, SciPy and Matplotlib functions in the
docs.

Great, that was what I was unsure of. I was worried that linking to
upstream projects would be sort of weird, but practicality beats purity for
sure here.

Eric

On Fri, 27 Apr 2018 at 22:26 Ralf Gommers <ralf.gommers at gmail.com> wrote:

> On Wed, Apr 25, 2018 at 11:00 PM, Eric Wieser <wieser.eric+numpy at gmail.com
> > wrote:
>
>> For precision loss of the order of float64 eps, I disagree.
>>
>> I was thinking more about precision loss on the order of 1, for large
>> 64-bit integers that can’t fit in a float64
>>
> It's late and I'm probably missing something, but:
>
> >>> np.iinfo(np.int64).max > np.finfo(np.float64).max
> False
>
> Either way, such weights don't really happen in real code I think.
>
>
>> Note also that #10864 <https://github.com/numpy/numpy/issues/10864>
>> incurs deliberate precision loss of the order 10**-6 x smallest bin, which
>> is also much larger than eps.
>>
> Yeah that's worse.
>
>
>> It’s also possible to refer users to scipy.stats.binned_statistic
>>
>> That sounds like a good idea to do irrespective of whether histogramdd
>> has problems - I had no idea those existed. Is there a precedent for
>> referring to more feature-rich scipy functions from the basic numpy ones?
>>
> Yes, there are cross-links to Python, SciPy and Matplotlib functions in
> the docs. This is done with intersphinx (
> https://github.com/numpy/numpy/blob/master/doc/source/conf.py#L215).
> Example cross-link for convolve:
> https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.convolve.html
>
> Ralf
>
>
>
>> 
>>
>> On Wed, 25 Apr 2018 at 22:51 Ralf Gommers <ralf.gommers at gmail.com> wrote:
>>
>>> On Wed, Apr 25, 2018 at 10:07 PM, Eric Wieser <
>>> wieser.eric+numpy at gmail.com> wrote:
>>>
>>>> what does that gain over having the user do something like
>>>> result.astype()
>>>>
>>>> It means that the user can use integer weights without worrying about
>>>> losing precision due to an intermediate float representation.
>>>>
>>>> It also means they can use higher precision values (np.longdouble) or
>>>> complex weights.
>>>>
>>> None of that seems particularly important to be honest.
>>>
>>> you’re emitting warnings for everyone
>>>>
>>>> When there’s a risk of precision loss, that seems like the responsible
>>>> thing to do.
>>>>
>>> For precision loss of the order of float64 eps, I disagree. There will
>>> be many such places in numpy and in other core libraries.
>>>
>>>
>>>> Users passing float weights would see no warning, I suppose.
>>>>
>>>> is this really worth a new function
>>>>
>>>> There ought to be a function for computing histograms with integer
>>>> weights that doesn’t lose precision. Either we change the existing function
>>>> to do that, or we make a new function.
>>>>
>>> It's also possible to refer users to
>>> scipy.stats.binned_statistic(_2d/dd), which provides a superset of the
>>> histogram functionality and is internally consistent because the
>>> implementations of 1d/2d call the dd one.
>>>
>>> Ralf
>>>
>>>
>>>
>>>> A possible compromise: like 1, but only change the dtype of the result
>>>> if a weights argument is passed.
>>>>
>>>> #10864 <https://github.com/numpy/numpy/issues/10864> seems like a
>>>> worrying design flaw too, but I suppose that can be dealt with separately.
>>>>
>>>> Eric
>>>> 
>>>>
>>>> On Wed, 25 Apr 2018 at 21:57 Ralf Gommers <ralf.gommers at gmail.com>
>>>> wrote:
>>>>
>>>>> On Mon, Apr 9, 2018 at 10:24 PM, Eric Wieser <
>>>>> wieser.eric+numpy at gmail.com> wrote:
>>>>>
>>>>>> Numpy has three histogram functions - histogram, histogram2d, and
>>>>>> histogramdd.
>>>>>>
>>>>>> histogram is by far the most widely used, and in the absence of
>>>>>> weights and normalization, returns an np.intp count for each bin.
>>>>>>
>>>>>> histogramdd (for which histogram2d is a wrapper) returns np.float64
>>>>>> in all circumstances.
>>>>>>
>>>>>> As a contrived comparison
>>>>>>
>>>>>> >>> x = np.linspace(0, 1)>>> h, e = np.histogram(x*x, bins=4); h
>>>>>> array([25, 10,  8,  7], dtype=int64)>>> h, e = np.histogramdd((x*x,), bins=4); h
>>>>>> array([25., 10.,  8.,  7.])
>>>>>>
>>>>>> https://github.com/numpy/numpy/issues/7845 tracks this inconsistency.
>>>>>>
>>>>>> The fix is now trivial: the question is, will changing the return
>>>>>> type break people’s code?
>>>>>>
>>>>>> Either we should:
>>>>>>
>>>>>>    1. Just change it, and hope no one is broken by it
>>>>>>    2. Add a dtype argument:
>>>>>>       - If dtype=None, behave like np.histogram
>>>>>>       - If dtype is not specified, emit a future warning
>>>>>>       recommending to use dtype=None or dtype=float
>>>>>>       - In future, change the default to None
>>>>>>    3. Create a new better-named function histogram_nd, which can
>>>>>>    also be created without the mistake that is
>>>>>>    https://github.com/numpy/numpy/issues/10864.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>
>>>>> (1)  sems like a no-go, taking such risks isn't justified by a minor
>>>>> inconsistency.
>>>>>
>>>>> (2) is still fairly intrusive, you're emitting warnings for everyone
>>>>> and still force people to change their code (and if they don't they may run
>>>>> into a backwards compat break).
>>>>>
>>>>> (3) is the best of these options, however is this really worth a new
>>>>> function? My vote would be "do nothing".
>>>>>
>>>>> Ralf
>>>>>
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list
>>>>> NumPy-Discussion at python.org
>>>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>>
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at python.org
>>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>>
>>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180428/24e5b752/attachment-0001.html>