[Numpy-discussion] Draft PEP for the new buffer interface to be in Python 3000
Travis Oliphant
oliphant at ee.byu.edu
Tue Feb 27 20:06:17 EST 2007
PEP: <unassigned>
Title: Revising the buffer protocol
Version: $Revision: $
Last-Modified: $Date: $
Author: Travis Oliphant <oliphant at ee.byu.edu>
Status: Draft
Type: Standards Track
Created: 28-Aug-2006
Python-Version: 3000
Abstract
This PEP proposes re-designing the buffer API (PyBufferProcs
function pointers) to improve the way Python allows memory sharing
in Python 3.0
In particular, it is proposed that the multiple-segment and
character buffer portions of the buffer API are eliminated and
additional function pointers are provided to allow sharing any
multi-dimensional nature of the memory and what data-format the
memory contains.
Rationale
The buffer protocol allows different Python types to exchange a
pointer to a sequence of internal buffers. This functionality is
'''extremely''' useful for sharing large segments of memory between
different high-level objects, but it's too limited and has issues.
1. There is the little (never?) used "sequence-of-segments" option
(bf_getsegcount)
2. There is the apparently redundant character-buffer option
(bf_getcharbuffer)
3. There is no way for a consumer to tell the buffer-API-exporting
object it is "finished" with its view of the memory and
therefore no way for the exporting object to be sure that it is
safe to reallocate the pointer to the memory that it owns (the
array object reallocating its memory after sharing it with the
buffer object which held the original pointer led to the
infamous buffer-object problem).
4. Memory is just a pointer with a length. There is no way to
describe what's "in" the memory (float, int, C-structure, etc.)
5. There is no shape information provided for the memory. But,
several array-like Python types could make use of a standard
way to describe the shape-interpretation of the memory
(!wxPython, GTK, pyQT, CVXOPT, !PyVox, Audio and Video
Libraries, ctypes, !NumPy, data-base interfaces, etc.)
There are two widely used libraries that use the concept of
discontiguous memory: PIL and NumPy. Their view of discontiguous
arrays is a bit different, though. NumPy uses the notion of
constant striding in each dimension as it's basic concept of an
array. In this way a simple sub-region of a larger array can be
described without copying the data. Strided memory is a common
way to describe data to many computing libraries (such as the BLAS
and LAPACK).
The PIL uses a more opaque memory representation. Sometimes an
image is contained in a contiguous segment of memory, but
sometimes it is contained in an array of pointers to the
contiguous segments (usually lines) of the image. This allows the
image to not be loaded entirely into memory. The PIL is where the
idea of multiple buffer segments in the original buffer interface
came from, I believe.
The buffer interface should allow discontiguous memory areas to
share standard striding information. However, consumers that do
not want to deal with strided memory should also be able to
request a contiguous segment easily.
Proposal Overview
* Eliminate the char-buffer and multiple-segment sections of the
buffer-protocol.
* Unify the read/write versions of getting the buffer.
* Add a new function to the protocol that should be called when
the consumer object is "done" with the view.
* Add a new function to allow the protocol to describe what is in
memory (unifying what is currently done now in struct and
array)
* Add a new function to allow the protocol to share shape
information
* Fix all objects in core and standard library to conform to the
new interface
* Extend the struct module to handle more format specifiers
Specification
Change the PyBufferProcs structure to
typedef struct {
getbufferproc bf_getbuffer
releasebufferproc bf_releasebuffer
formatbufferproc bf_getbufferformat
shapebufferproc bf_getbuffershape
}
typedef PyObject *(*getbufferproc)(PyObject *obj, void **buf,
Py_ssize_t *len, int requires)
Return a pointer to memory in buf and the length of that memory
buffer in buf. Requirements for the memory are provided in
requires (PYBUFFER_WRITE, PYBUFFER_ONESEGMENT). NULL is
returned and an error raised if the object cannot return a view
with those requirements. Otherwise, an object-specific "view"
object is returned (which can just be a borrowed reference to
obj).
This view object should be used in the other API calls and
does not need to be decref'd. It should be "released" if the
interface exporter provides the bf_releasebuffer function.
typedef int (*releasebufferproc)(PyObject *view)
This function is called when a view of memory previously
acquired from the object is no longer needed. It is up to the
exporter of the API to make sure all views have been released
before eliminating a reference to a previously returned pointer.
It is up to consumers of the API to call this function on the
object whose view is obtained when it is no longer needed. A -1
is returned on error and 0 on success.
typedef char *(*formatbufferproc)(PyObject *view, int *itemsize)
Get the format-string of the memory using the struct-module
string syntax (see below for proposed additions to that syntax).
Also, there is never an alignment assumption in this
string---the full byte-layout is always required. If the
implied size of this string is smaller than the length of the
buffer then it is assumed that the string is repeated.
If itemsize is not NULL, then return the size implied by the
format string. This could be the entire length of the buffer or
just the length of each element. It is equivalent to *itemsize
= PyObject_SizeFromFormat(ret) if ret is the returned string.
However, very often objects already know the itemsize without
having to compute it separately.
typedef PyObject *(*shapebufferproc)(PyObject *view)
Return a 2-tuple of lists containing shape information: (shape,
strides). The strides object can be None if the memory is
C-style contiguous) otherwise it provides the striding in each
dimension.
All of these routines are optional for a type object (but the last
three make no sense unless the first one is implemented).
New C-API calls are proposed
int
PyObject_CheckBuffer(PyObject *obj)
return 1 if the getbuffer function is available otherwise 0
PyObject *
PyObject_GetBuffer(PyObject *obj, void **buf, Py_ssize_t *len,
int requires)
return a borrowed reference to a "view" object of memory for the
object. Requirements for the memory should be given in requires
(PYBUFFER_WRITE, PYBUFFER_ONESEGMENT). The memory pointer is in
*buf and its length in *len.
Note, the memory is not considered a single segment of memory
unless PYBUFFER_ONESEGMENT is used in requires. Get possible
striding using PyObject_GetBufferShape on the view object.
int
PyObject_ReleaseBuffer(PyObject *view)
call this function to tell obj that you are done with your "view"
This is a no-op if the object doesn't implement a release function.
Only call this after a previous PyObject_GetBuffer has succeeded.
Return -1 on error.
char *
PyObject_GetBufferFormat(PyObject *view, int *itemsize)
Return a NULL-terminated string indicating the data-format of
the memory buffer. The string is in struct-module syntax with
the exception that there is never an alignment assumption (all
bytes must be accounted for). If the length of the buffer
indicated by this string is smaller than the total length of the
buffer, then a repeat of the string is implied to fill the
length of the buffer.
If itemsize is not NULL, then return the implied size
of each item (this could be calculated from the format string
but it is often known by the view object anyway).
PyObject *
PyObject_GetBufferShape(PyObject *view)
Return a 2-tuple of lists (shape, stride) providing the
multi-dimensional shape of the memory area. The stride
shows how many bytes to skip in each dimension to move
in that dimension from the start of the array.
Memory that is not a single contiguous-buffer can be represented
with the pointer returned from GetBuffer and the shape and
strides returned from GetBufferShape.
int PyObject_SizeFromFormat(char *)
Return the implied size of the data-format area from a struct-style
description.
Additions to the struct string-syntax
The struct string-syntax is missing some characters to fully
implement data-format descriptions already available elsewhere (in
ctypes and NumPy for example). Here are the proposed additions:
Character Description
==================================
'1' bit (number before states how many bits)
'?' platform _Bool type
'g' long double
'F' complex float
'D' complex double
'G' complex long double
'c' ucs-1 (latin-1) encoding
'u' ucs-2
'w' ucs-4
'O' pointer to Python Object
'T{}' structure (detailed layout inside {})
'(k1,k2,...,kn)' multi-dimensional array of whatever follows
':name:' optional name of the preceeding element
'&' specific pointer (prefix before another charater)
'X{}' pointer to a function (optional function
signature inside {})
The struct module will be changed to understand these as well and
return appropriate Python objects on unpacking. Un-packing a
long-double will return a c-types long_double. Unpacking 'u' or
'w' will return Python unicode. Unpacking a multi-dimensional
array will return a list of lists. Un-packing a pointer will
return a ctypes pointer object. Un-packing a bit will return a
Python Bool.
Endian-specification ('=','>','<') is also allowed inside the
string so that it can change if needed. The previously-specified
endian string is enforce at all times. The default endian is '='.
According to the struct-module, a number can preceed a character
code to specify how many of that type there are. The
(k1,k2,...,kn) extension also allows specifying if the data is
supposed to be viewed as a (C-style contiguous, last-dimension
varies the fastest) multi-dimensional array of a particular format.
Functions should be added to ctypes to create a ctypes object from
a struct description, and add long-double, and ucs-2 to ctypes.
Code to be affected
All objects and modules in Python that export or consume the old
buffer interface will be modified. Here is a partial list.
* buffer object
* bytes object
* string object
* array module
* struct module
* mmap module
* ctypes module
anything else using the buffer API
Issues and Details
The proposed locking mechanism relies entirely on the objects
implementing the buffer interface to do their own thing. Ideally
an object that implements the buffer interface should keep at least
a number indicating how many releases are extant.
The handling of discontiguous memory is new and can be seen as a
modification of the multiple-segment interface. It is motivated by
NumPy (used to be Numeric). NumPy objects should be able to share
their strided memory with code that understands how to manage
strided memory.
Code should also be able to request contiguous memory if needed and
objects exporting the buffer interface should be able to handle
that either by raising an error (or constructing a read-only
contiguous object and returning that as the view).
Currently the struct module does not allow specification of nested
structures. It seems like specifying a nested structure should be
specified as several ways of viewing memory areas (ctypes and
NumPy) already allow this.
Copyright
This PEP is placed in the public domain
More information about the NumPy-Discussion
mailing list