[Numpy-discussion] Low-level API for Random

Sat Sep 21 00:47:49 EDT 2019

On Fri, Sep 20, 2019 at 9:31 PM Robert Kern <robert.kern at gmail.com> wrote:

> On Fri, Sep 20, 2019 at 11:33 PM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
>
>>
>>
>> On Fri, Sep 20, 2019 at 7:09 AM Robert Kern <robert.kern at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 6:09 AM Ralf Gommers <ralf.gommers at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 5:29 AM Robert Kern <robert.kern at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> We might end up with more than 2 implementations if we need to change
>>>>> something about the function signature, for whatever reason, and we want to
>>>>> retain C/Cython API compatibility with older code. The C functions aren't
>>>>> necessarily going to be one-to-one to the Generator methods. They're just
>>>>> part of the implementation. So for example, if we wanted to, say,
>>>>> precompute some intermediate values from the given scalar parameters so we
>>>>> don't have to recompute them for each element of the `size`-large requested
>>>>> output, we might do that in one C function and pass those intermediate
>>>>> values as arguments to the C function that does the actual sampling. So
>>>>> we'd have two C functions for that one Generator method, and the sampling C
>>>>> function will not have the same signature as it did before the modification
>>>>> that refactored the work into two functions. In that case, I would not be
>>>>> so strict as to require that `Generator.foo` is one to one with
>>>>> `random_foo`.
>>>>>
>>>>
>>>> You're saying "be so strict" as if it were a bad thing, or a major
>>>> effort.
>>>>
>>>
>>> I am. It's an unnecessary limitation on the C API without a
>>> corresponding benefit. Your original complaint
>>>
>>
>> It's not a "complaint".
>>
>
> Please forgive me. That word choice was not intended to be dismissive. I
> don't view "complaints" as minor annoyances that the "complainer" should
> just shut up and deal with, or that the "complainer" is just being
> annoying, but I can see how it came across that I might. Please continue as
> if I said "The problem you originally noted...". It's a real problem that
> needs to be addressed. We just have different thoughts on exactly what is
> needed to address it.
>

Okay, thank you:)

>
>> We're having this discussion because we shipped a partial API in 1.17.0
>> that we will now have to go back and either take out or clean up in 1.17.3.
>> The PR for the new numpy.random grew so large that we didn't notice or
>> discuss that (such things happen, no big deal - we have limited reviewer
>> bandwidth). So now that we do, it makes sense to actually think about what
>> needs to be in the API. For now I think that's only the parts that are
>> matching the Python API plus what is needed to use them from C/Cython.
>> Future additions require similar review and criteria as adding to the
>> Python API and the existing NumPy C API. To me, your example seems to (a)
>> not deal with API stability, and (b) expose too much implementation detail.
>>
>> To be clear about the actual status, we:
>> - shipped one header file (bitgen.h)
>> - shipped two pxd files (common.pxd, bit_generator.pxd)
>> - removed a header file we used to ship (randomkit.h)
>> - did not ship distributions.pxd, bounded_integers.pxd,
>> legacy_distributions.pxd or related header files
>>
>> bit_generator.pxd looks fine, common.pxd contains parts that shouldn't be
>> there. I think the intent was to ship at least distributions.pxd/h, and
>> perhaps all of those pxd files.
>>
>> is much more directly addressed by a "don't gratuitously name related C
>>> functions differently than the Python methods they implement" rule (e.g.
>>> "gauss" instead of "normal").
>>>
>>>
>>>> I understand that in some cases a C API can not be evolved in the same
>>>> way as a Python API, but in the example you're giving here I'd say you want
>>>> one function to be public, and one private. Making both public just exposes
>>>> more implementation details for no good reason, and will give us more
>>>> maintenance issues long-term.
>>>>
>>>
>>> Not at all. In this example, neither one of those functions is useful
>>> without the other. If one is public, both must be.
>>>
>>
>> If neither one is useful without the other, it sounds like both should be
>> private and the third one that puts them together - the one that didn't
>> change signature and implements `Generator.foo` - is the public one.
>>
>
> That defeats the point of using the C API in this instance, though. The
> reason it got split into two (in this plausible hypothetical; I'm thinking
> of the binomial implementation here, which caches these intermediates in a
> passed-in struct) is because in C you want to call them in different ways
> (in this case, the prep function once and the sampling function many
> times). It's not that you always call them in lockstep pairs: `prep();
> sample(); prep(); sample();`. A C function that combines them defeats the
> efficiency that one wanted to gain by using the C API. The C API has
> different needs than the Python API, because the Python API has a lot more
> support from the Python language and numpy data structures to be able to
> jam a lot of functionality into a single function signature that C just
> doesn't give us. The purpose of the C API is not just to avoid Python
> function call overhead. If there's a reason that the Generator method needs
> the implementation split up into multiple C functions, that's a really
> strong signal that *other* C code using the C API will need that same
> split. It's not just an implementation detail; it's a documented use case.
>
> Given the prevalence of Cython, it's actually really easy to use the
> Python API pretty easily in "C", so it's actually a huge waste if the C API
> matches the Python API too closely. The power and utility of the C API will
> be in how it *differs* from the Python API. For the distribution methods,
> this is largely in how it lets you sample one number at a time without
> bothering with the numpy and broadcasting overhead. That's the driving
> motivation for having a C API for the distributions, and the algorithms
> that we choose have consequences for the C API that will best satisfy that
> motivation.
>
> The issue that this raises with API stability is that if you require a
> one-to-one match between the C API function and the Generator method, we
> can never change the function signature of the C function. That's going to
> forbid us from moving from an algorithm that doesn't need any
> precomputation to one that does. That precomputation either requires a
> 2-function dance or a new argument to keep the cached values (c.f.
> `random_binomial()`), so it's always going to affect the API. To use such a
> new algorithm, we'll have to add a new function or two to the C API and
> document the deprecation of the older API function. We can't just swap it
> in under the same name, even if the new function is standalone. That's a
> significant constraint on future development when the main issue that led
> to the suggestion is that the names were sometimes gratuitously different
> between the C API and the Python API, which hindered discoverability. We
> can fix *that* problem easily without constraining the universe of
> algorithms that we might consider using in the future.
>

Fair enough, now the use case is clear to me. In summary: there may be real
reasons to deviate and add more functions; let's do so if and when it makes
sense to.

Cheers,
Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190920/4c639dd2/attachment.html>