[Python-Dev] The buffer interface

Jeff Collins jcollins@endtech.com
Mon, 16 Oct 2000 13:22:22 -0700 (PDT)


All this just when I was getting accustomed to the thought of using buffer
objects in the Palm Python port...

I need buffer objects for many the same reasons as Greg Stein originally
proposed, as you quoted below.  The on the Palm, the datamanager heap
(used for permanent database storage and limited by the physical memory
size) already stores the compiled python module.  Directly referencing the
data of objects like bytecodes and strings would greatly reduce the
dynamic heap (current limit of 256K on PalmOS 3.5 on devices with 4M RAM
or greater) requirements.

Buffer objects seem like a natural choice.  A record in a Palm database is
just chunk of contiguous memory.  Representing this chunk as a buffer
object would allow the direct referencing it and any of it's slices.  So,
the co_code of code objects could be unmarshalled with a reference to
permanent storage.  Further, with the appropriate modifications, string
objects (char *ob_sval?) could access this memory as well, though this
additional optimization is probably only appropriate for small platforms.

I think that buffer object is fairly important.  They provide a mechanism
for exposing arbitrary chunks of memory (eg, PyBuffer_FromMemory),
something that no other python object does, AFIAK.  Perhaps clarifying the
interface (such as the slice operator returning a buffer, as suggested
below) and providing more hooks from Python for creating buffers (via
newmodule, say) would be helpful.



On Mon, 16 Oct 2000, Guido van Rossum wrote:

> The buffer interface is one of the most misunderstood parts of
> Python.  I believe that if it were PEPped today, it would have a hard
> time getting accepted in its current form.
> 
> There are also two different parts that are commonly referred by this
> name: the "buffer API", which is a C-only API, and the "buffer
> object", which has both a C API and a Python API.
> 
> Both were largely proposed, implemented and extended by others, and I
> have to admit that I'm still uneasy with defending them, especially
> the buffer object.  Both are extremely implementation-dependent (in
> JPython, neither makes much sense).
> 
> The Buffer API
> --------------
> 
> The C-only buffer API was originally intended to allow efficient
> binary I/O from and (in some cases) to large objects that have a
> relatively well-understood underlying memory representation.  Examples
> of such objects include strings, array module arrays, memory-mapped
> files, NumPy arrays, and PIL objects.
> 
> It was created with the desire to avoid an expensive memory-copy
> operation when reading or writing large arrays.  For example, if you
> have an array object containing several millions of double precision
> floating point numbers, and you want to dump it to a file, you might
> prefer to do the I/O directly from the array's memory buffer rather
> than first copying it to a string.  (You lose portability of the data,
> but that's often not a problem the user cares about in these cases.)
> 
> An alternative solution for this particular problem was consdered:
> object types in need of this kind of efficient I/O could define their
> own I/O methods, thereby allowing them to hide their internal
> representation.  This was implemented in some cases (e.g. the array
> module has read() and write() methods) but rejected, because a
> simple-minded implementation of this approach would not work with
> "file-like" objects (e.g. StringIO files).  It was deemed important
> that file-like objects would not place restrictions on the kind of
> objects that could interact with them (compared to real file objects).
> 
> A possible solution would have been to require that each object
> implementing its own read and write methods should support both
> efficient I/O to/from "real" file objects and fall-back I/O to/from
> "file-like" objects.  The fall-back I/O would have to convert the
> object's data to a string object which would then be passed to the
> write() method of the file-like object.  This approach was rejected
> because it would make it impossible to implement an alternative file
> object that would be as efficient as the real file object, since large
> object I/O would be using the inefficient fallback interface.
> 
> To address these issues, we decided to define an interface that would
> let I/O operations ask the objects where their data bytes are in
> memory, so that the I/O can go directly to/from the memory allocated
> by the object.  This is the classic buffer API.  It has a read-only
> and a writable variant -- the writable variant is for mutable objects
> that will allow I/O directly into them.  Because we expected that some
> objects might have an internal representation distributed over a
> (small) number of separately allocated pieces of memory, we also added
> the getsegcount() API.  All objects that I know support the buffer API
> return a segment count of 1, and most places that use the buffer API
> give up if the segment count is larger; so this may be considered as
> an unnecessary generalization (and source of complexity).
> 
> The buffer API has found significant use in a way that wasn't
> originally intended: as a sort of informal common base class for
> string-like objects in situations where a char[] or char* type must be
> passed (in a read-only fashion) to C code.  This is in fact the most
> common use of the buffer API now, and appears to be the reason why the
> segment count must typically be 1.
> 
> In connection with this, the buffer API has grown a distinction
> between character and binary buffers (on the read-only end only).
> This may have been a mistake; it was intended to help with Unicode but
> it ended up not being used.
> 
> The Buffer Object
> -----------------
> 
> The buffer object has a much less clear reason for its existence.
> When Greg Stein first proposed it, he wrote:
> 
>     The intent of this type is to expose a string-like interface from
>     an object that supports the buffer interface (without making a
>     copy). In addition, it is intended to support slices of the target
>     object.
> 
>     My eventual goal here is to tweak the file object to support
>     memory mapping and the buffer interface. The buffer object can
>     then return slices of the file without making a new copy. Next
>     step: change marshal.c, ceval.c, and compile.c to support a buffer
>     for the co_code attribute. Net result is that copies of code
>     streams don't need to be copied onto the heap, but can be left in
>     an mmap'd file or a frozen file. I'm hoping there will be some
>     perf gains (time and memory).
> 
>     Even without some of the co_code work, enabling mmap'd files and
>     buffers onto them should be very useful. I can probably rattle off
>     a good number of other uses for the buffer type.
> 
> I don't think that any of these benefits have been realized yet, and
> altogether I think that the buffer object causes a lot of confusion.
> The buffer *API* doesn't guarantee enough about the lifetime of the
> pointers for the buffer *object* to be able to safely preserve those
> pointers, even if the buffer object holds on to the base object.  (The
> C-level buffer API informally guarantees that the data remains valid
> only until you do anything to the base object; this is usually fine as
> long as you don't release the global interpreter lock.)
> 
> The buffer object's approach to implementing the various sequence
> operations is strange: sometimes it behaves like a string, sometimes
> it doesn't.  E.g. a slice returns a new string object unless it
> happens to address the whole buffer, in which case it returns a
> reference to the existing buffer object.  It would seem more logical
> that a subslice would return a new buffer object.  Concatenation and
> repetition of buffer objects are likewise implemented inconsistently;
> it would have been more consistent with the intended purpose if these
> weren't supported at all (i.e. if none of the buffer object operations
> would allocate new memory except for buffer object headers).
> 
> I would have concluded that the buffer object is entirely useless, if
> it weren't for some very light use that is being made of it by the
> Unicode machinery.  I can't quite tell whether that was done just
> because it was convenient, or whether that shows there is a real
> need.
> 
> What Now?
> ---------
> 
> I'm not convinced that we need the buffer object at all.  For example,
> the mmap module defines a sequence object so doesn't seem to need the
> buffer object to help it support slices.
> 
> Regarding the buffer API, it's clearly useful, although I'm not
> convinced that it needs the multiple segment count option or the char
> vs. binary buffer distinction, given that we're not using this for
> Unicode objects as we originally planned.
> 
> I also feel that it would be helpful if there was an explicit way to
> lock and unlock the data, so that a file object can release the global
> interpreter lock while it is doing the I/O.  But that's not a high
> priority (and there are no *actual* problems caused by the lack of
> such an API -- just *theoretical*).
> 
> For Python 3000, I think I'd like to rethink this whole mess.  Perhaps
> byte buffers and character strings should be different beasts, and
> maybe character strings could have Unicode and 8-bit subclasses (and
> maybe other subclasses that explicitly know about their encoding).
> And maybe we'd have a real file base class.  And so on.
> 
> What to do in the short run?  I'm still for severely simplifing the
> buffer object (ripping out the unused operations) and deprecating it.
> 
> --Guido van Rossum (home page: http://www.python.org/~guido/)
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Jeffery D. Collins  
Sr. Software Developer
Endeavors Technology, Inc.