[Python-Dev] pickling of large arrays

Scott Gilbert xscottg@yahoo.com
Thu, 20 Feb 2003 05:58:18 -0800 (PST)


--- "Ralf W. Grosse-Kunstleve" <rwgk@yahoo.com> wrote:
> This is question is related to PEP 307, "Extensions to the pickle
> protocol",
> http://www.python.org/peps/pep-0307.html .
> 
> Apparently the new Pickle "protocol 2" provides a mechanism for
> avoiding large temporaries, but only for lists and dicts (section
> "Pickling of large lists and dicts" near the end). I am wondering if
> the new protocol could also help us to eliminate large temporaries when
> pickling Boost.Python extension classes.
> 
> We wrote an open source C++ array library with Boost.Python bindings.
> For pickling we use the __getstate__, __setstate__ protocol. As it
> stands pickling involves converting the arrays to Python strings,
> similar to what is done in Numpy. There are two mechanisms:
> 
> 1. "single buffered":
> 
>    For numeric types (int, long, double, etc.) a Python string is
>    allocated based on an upper estimate for the required size
>    (PyString_FromStringAndSize). The entire numeric array is converted
>    directly to that string. Finally the Python string is resized
>    (_PyString_Resize).
>    With this mechanism there are 2 copies of the array in memory:
>      - the original array and
>      - the Python string.
> 
> 2. "double buffered":
> 
>    For some user-defined element types it is very difficult to estimate
>    an upper limit for the size of the string representation. Therefore
>    the array is first converted to a dynamically growing C++
>    std::string, which is then copied to a Python string.
>    With this mechanism there are 3 copies of the array in memory:
>      - the original array,
>      - the std::string, and
>      - the Python string.
> 
> For very large arrays the memory overhead can be a limiting factor.
> Could the new protocol 2 help us in some way?
> 


I hadn't seen this PEP yet.  Very nice.

I don't know how to solve your second case with it, but it looks like you
could solve your first case with very little overhead:

Have your __reduce__ method return a 4-tuple (function, arguments, state,
listitems) with:
 
  function = a constructing function that takes the length of your array in
bytes, and the type of the data in the array

  arguments = a 2-tuple specifying the bytes and type

  state = None

  listitems = an iterator that returns small chunks of memory at a time. 
For instance have it generate your array with 1K or 8K strings at a time.

This strategy should avoid having to "double buffer" your array when
pickling.  The overhead would be the 1 or 8 K that would presumably be
reused multiple times while pickling a large array.  Your multi-megabyte
original array would never have to be copied all at once.

For unpickling, you'd have your constructor function allocate all the space
in one go, and make the semantics of your append() or extend() method do
the right thing.


I've been gone for a while, is this PEP going to be included in the final
version of 2.3?










__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/