[Numpy-discussion] Buffer interface PEP

Wed Mar 28 00:35:20 EDT 2007

>> Is this saying that either NULL or a pointer to "B" can be supplied
>> by getbufferproc to indicate to the caller that the array is unsigned
>> bytes? If so, is there a specific reason to put the (minor)
>> complexity of handling this case in the caller's hands, instead of
>> dealing with it internally to getbufferproc? In either case, the
>> wording is a bit unclear, I think.
>>
>
> Yes, the wording could be more clear.   I'm trying to make it easy for
> exporters to change
> to the new buffer interface.
>
> The main idea I really want to see is that if the caller just passes
> NULL instead of an address then it means they are assuming the data  
> will
> be "unsigned bytes"   It is up to the exporter to either allow this or
> raise an error.
>
> The exporter should always be explicit if an argument for returning  
> the
> format is provided (I may have thought differently a few days ago).

Understood -- I'm for the exporters being as explicit as possible if  
the argument is provided.

>> The general question is that there are several other instances where
>> getbufferproc is allowed to return ambiguous information which must
>> be handled on the client side. For example, C-contiguous data can be
>> indicated either by a NULL strides pointer or a pointer to a  
>> properly-
>> constructed strides array.
>
> Here.  I'm trying to be easy on the exporter and the consumer.  If the
> data is contiguous, then neither the exporter nor will likely care  
> about
> the strides.  Allowing this to be NULL is like the current array
> protocol convention which allows this to be None.

See below. My comments here aren't suggesting that NULL should be  
disallowed. I'm basically wondering whether it is a good idea to  
allow NULL and something else to represent the same information.  
(E.g. as above, an exporter could choose to show C-contiguous data  
with a NULL returned to the client, or with a trivial strides array).

Otherwise two different exporters exporting identical data could  
provide different representations, which the clients would need to be  
able to handle. I'm not sure that this is a recipe for perfect  
interoperability.

>> Clients that can't handle C-contiguous
>> data (contrived example, I know there is a function to deal with
>> that) would then need to check both for NULL *and* inside the strides
>> array if not null, before properly deciding that the data isn't
>> usable them.
> Not really.  A client that cannot deal with strides will simply not  
> pass
> an address to a stride array to the buffer protocol (that argument  
> will
> be NULL).  If the exporter cannot provide memory without stride
> information, then an error will be raised.

This doesn't really address my question, which I obscured with a  
poorly-chosen example. The PEP says (or at least that's how I read  
it) that if the client *does* provide an address for the stride  
array, then for un-strided arrays, the exporter may either choose to  
fill on NULL at that address, or provide a strides array.

Might it be easier for clients if the PEP required that NULL be  
returned if the array is C-contiguous? Or at least strongly suggested  
that? (I understand that there might be cases where an naive exporter  
"thinks" it is dealing with a strided array but it really is  
contiguous, and the exporter shouldn't be required to do that  
detection.)

The use-case isn't too strong here, but I think it's clear in the  
suboffsets case (see below).

>> Similarly, the suboffsets can be either all negative or
>> NULL to indicate the same thing.
> I think it's much easier to check if suboffsets is NULL rather than
> checking all the entries to see if they are -1 for the very common  
> case
> (i.e. the NumPy case) of no dereferencing.    Also, if you can't deal
> with suboffsets you would just not provide an address for them.

My point exactly! As written, the PEP allows an exporter to either  
return NULL, or an array of all negative numbers (in the case that  
the client requested that information), forcing a fully -conforming  
client to make *both* checks in order to decide what to do.

Especially in this case, it would make sense to require a NULL be  
returned in the case of no suboffsets. This makes things easier for  
both clients that can deal with both suboffsets or non-offsets (e.g.  
they can branch on NULL, not on NULL or all-are-negative), and also  
for clients that can *only* deal with suboffsets.

Now, in these two cases, the use-case is pretty narrow, I agree.  
Basically it makes things easier for savvy clients that can deal with  
different data types, by not forcing them to make two checks (strides  
== NULL or strides array is trivial; suboffsets == NULL or suboffsets  
are all negative) when one would do. Again, this PEP allows the same  
information can be passed in two very different ways, when it really  
doesn't seem like that ambiguity makes life that much easier for  
exporters.

Maybe I'm wrong about this last point, though. Then there comes the  
trade-off -- should savvy clients bear the complexity of checking two  
different things? (Simple clients needn't check anything -- they just  
pass in NULL.) Or should the complexity be pushed to the savvy  
exporter to do those checks? (Simple exporters just return NULL in  
those variables.) I guess the question comes down to which side of  
the API to make the simplest, given that it appears to me that the  
complexity has to live somewhere.

As a separate suggestion, I think a few sentences in the PEP about  
the protocol design, and what parts are explicitly added to make it  
easy for simple clients and exporters, would be helpful. Something like:
"Clients that cannot deal with strided or suboffset-ed arrays should  
put NULL values in the corresponding getbufferproc call parameters.  
Then exporters will provide data in that format (either because the  
data are already in that format, or because the exporter chooses to  
convert it on behalf of the client), or the exporter will set an  
exception. This simplifies matters greatly for simple clients.  
Likewise, simple exporters which only provide C-contiguous data, or  
data with no suboffsets can simply return NULL if those values are  
requested."

>> Might it be more appropriate to specify only one canonical behavior
>> in these cases? Otherwise clients which don't do all the checks on
>> the data might not properly interoperate with providers which format
>> these values in the alternate manner.
>>
> It's important to also be easy to use.  I don't think clients  
> should be
> required to ask for strides and suboffsets if they can't handle them.

Again, that wasn't my suggestion. My suggestion was merely that if  
clients ask for that information, that it come in a canonical form,  
so that NULL values are meaningful, as opposed to that client  
potentially needing to check two different things before deciding  
which code-path to embark upon. As above, this might or might not  
have the effect of adding extra complexity to the exporters. If not,  
good; if so, then it's worth specifically deciding which side that  
complexity ought to live upon.

>>> 279 Get the buffer and optional information variables about the
>>> buffer.
>>> 280 Return an object-specific view object (which may be simply a
>>> 281 borrowed reference to the object itself).
>>>
>> This phrasing (and similar phrasing elsewhere) is somewhat opaque to
>> me. What's an "object-specific view object"?
>>
> At the moment it's the buffer provider.  It is not defined because it
> could be a different thing for each exporter.   We are still  
> discussing
> this particular point and may drop it.

Fair enough. Definitely worth a clear explanation if its not dropped  
though.

>>> 333 The struct string-syntax is missing some characters to fully
>>> 334 implement data-format descriptions already available  
>>> elsewhere (in
>>> 335 ctypes and NumPy for example).  Here are the proposed additions:
>>>
>> Is the following table just the additions? If so, it might be good to
>> show the full spec, and flag the specific additions. If not, then the
>> additions should be flagged.
>
> Yes, these are just the additions.  I don't want to do the full  
> spec, it
> is already available elsewhere in the Python docs.

Would be useful to link to the full spec from the PEP, in that case.

>
>>
>>> 341 't'               bit (number before states how many bits)
>>>
>> vs.
>>
>>> 372 According to the struct-module, a number can preceed a character
>>> 373 code to specify how many of that type there are.  The
>>>
>> I'm confused -- could this be phrased more clearly? Does '5t' refer
>> to a field 5-bits wide, or 5-one bit fields? Is 'ttttt' allowed? If
>> so, is it equivalent to or different from '5t'?
>>
> Yes, 'ttttt' is equivalent to '5t'  and the difference between one  
> field
> 5-bits wide or 5-one bit fields is a confusion based on thinking there
> are fields at all.   Both of those are equivalent.  If you want  
> "fields"
> then you have to define names.

In that case, line 341 should be clarified. Right now, that line sort  
of makes it seem like the struct module should somehow unpack a '5t'  
into a single python object of some type, analogous to how the other  
entities are unpacked into single objects (which may themselves be  
composite, e.g. the list-of-lists and nested cases).

Lines 372-3 make it clear that, say, '5g' would be unpacked into five  
python floats, but the fact that the 'bit' definition in line 341  
explicitly mentions the number before, while no other definitions do  
so, almost make it appear that '5t' is supposed to be treated as a  
single atomic object in the same way that 'g' alone would be. Since  
this isn't the case, I would suggest dropping the parenthetical in  
line 341 as redundant and potentially misleading.

>> In general, the logic of the 'locking mechanism' should be described
>> at a high level at some point. It's described in nitty-gritty
>> details, but at least I would have appreciated a bit more of a
>> discussion about the general how and why -- this would be helpful to
>> clients trying to use the locking mechanism properly.
>>
>
> The point of locking is so that the exporter knows when it can
> reallocate its buffer.  Right now, reference counting is the only  
> way to
> do that.  But reference counting is not specific enough.  Perhaps the
> reference is because of an object that is using the same memory but
> perhaps the reference is just another name pointing to exactly the  
> same
> object.
>
> In the case of NumPy, NumPy needs to know when the resize method  
> can be
> safely applied.   Currently, it is ambiguous and un-clear when a NumPy
> array can re-allocate its own buffer.  Also, in the past exposing the
> array object in Python's memory and then later re-allocating it led to
> problems.
>
> I'll try and address this more clearly.

That makes sense. What I think is needed is a high-ish level  
introduction to the "moving parts" of locking -- essentially covering  
what a simple client or exporter needs to know in order to use the  
interface. I felt that the discussion was a bit too low-level,  
leaving folks in danger of missing the forest for the trees as it were.

Zach