[Python-3000] pre-PEP: Enhancing buffer protocol (tp_as_buffer)

Mon Feb 26 20:24:39 CET 2007

Guido van Rossum wrote:
> On 2/25/07, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> 
>>Travis Oliphant wrote:
>>
>>
>>>   2. There is no way for a consumer to tell the protocol-exporting
>>>object it is "finished" with its view of the memory and therefore no way
>>>for the object to be sure that it can reallocate the pointer to the
>>>memory that it owns (the array object reallocating its memory after
>>>sharing it with the buffer object led to the infamous buffer-object
>>>problem).
>>
>
> Another problem that would be solved by this is the current unsafety
> of blocking I/O operations like file.readinto() and
> socket.recv_into(). These operations do roughly the following:
> 
> (1) get the pointer and length from the buffer API
> (2) release the GIL
> (3) call the blocking read() or recv() system call with the pointer and length
> (4) reacquire the GIL
> 
> The problem is that while the GIL is released, another thread with
> access to the object whose buffer is being read into, could modify it
> causing the buffer to be moved in memory, and the read() or recv()
> operation will be overwriting freed memory (or worse, memory allocated
> for a different purpose).
> 
> I realized this thinking about the 3.0 bytes object, but the 2.x array
> object has the same problems, and probably every other object that
> uses the buffer API and has a mutable size (if there are any).

Yes, the NumPy object has this problem as well (although it has *very* 
conservative checks so that if the reference count on the array is not 
1, memory is not reallocated).

> 
> I agree that getting the pointer and length should be separated from
> finding out how the bytes should be interpreted. I'd like to propose a
> simple stack or hierarchy of classes to address (what I think are)
> Travis's needs:
> 
> - At the bottom is a redesigned buffer API: add locking, remove
> segcount and char buffers.

Great.  I have no problem with this.  Is your idea of locking the same 
as mine (i.e. a function in the API for release?)

> 
> - There is a mixin class (at least conceptually it's a mixin) which
> takes anything implementing the redesigned buffer API and adds the
> bytes API (see recently updated PEP 358); operations like .strip() or
> slicing should return copies (of the same or a different type) or
> views at the discretion of the underlying object. (Maybe there should
> be a read-only and read-write version of this; note that read-only is
> not the same as immutable, since the underlying buffer may be modified
> by other APIs, if it allows this.)

I'm not sure what this mixin class is.  Is this a base class for the 
bytes object?   I need to understand this better in order to write a PEP.

> 
> - *Another* API built on top of the redesigned buffer API would be
> something more aligned with numpy's needs, adding (a) a shape
> descriptor indicating the size, offset and stride of each dimension,
> and (b) a record descriptor indicating the interpretation of one
> element of the array. For (a), a list of 3-tuples of ints would
> probably be sufficient (constrained so that no valid combination of
> indexes points outside the buffer); for (b), I propose (with Jim
> Hugunin who first suggested this at PyCon) to use the same concise but
> expressing format-string-like notation used by the struct module. (The
> bytes API is not quite a special case of this, since it provides more
> string-like operations.)
> 

Great.  NumPy has already adopted the struct standard for it's "hidden" 
character codes.

We also need to add some format codes for complex-data ('F','D','G') and 
for long doubles ('g').    I would also propose that we make an 
enumeration in Python so we can refer to these codes in C/C++ as constants:

PYFORMAT_LONG
PYFORMAT_UINT

etc.

a) I would prefer a 3-tuple of lists for the shape descriptor
(shape list, stride list, offset list)

That way default striding could be given as None and there would not 
have to be any offset as well.

My view on the offset is that it is not necessary as the start of the 
array is already given by the memory pointer.  But, if others see a 
strong need for it, I have no problem with including it.

b) I'm also fine with just returning a string for the record descriptor 
like the struct module uses.

-Travis

> The crucial idea here (like so often :-) is not to use inheritance but
> composition. This means that we can separate management of the buffer
> (e.g. malloc, mmap, whatever) from providing APIs on top of this
> (either the bytes API or the multi-dimensional array API).
>