dynamic allocation file buffer

Larry Bates larry.bates at vitalEsafe.com
Tue Sep 9 18:44:23 EDT 2008


castironpi wrote:
> I will try my idea again.  I want to talk to people about a module I
> want to write and I will take the time to explain it.  I think it's a
> "cool idea" that a lot of people, forgiving the slang, could benefit
> from.  What are its flaws?
> 
> A user has a file he is using either 1/ to persist binary data after
> the run of a single program (persistence) or 2/ share binary data
> between concurrently running programs (IPC / shared memory).  The data
> are records of variable types and lengths that can change over time.
> He wants to change a record that's already present in the file.  Here
> are two examples.
> 
> Use Case 1: Hierarchical ElementTree-style data
> 
> A user has an XML file like the one shown here.
> 
> <a>
>   <b>
>     <c>Foo</c>
>   </b>
>   ...
> 
> He wants to change "Foo" to "Foobar".
> 
> <a>
>   <b>
>     <c>Foobar</c>
>   </b>
>   ...
> 
> The change he wants to make is at the beginning of a 4GB file, and
> recopying the remainder is an unacceptable resource drain.
> 
> Use Case 2: Web session logger
> 
> A tutor application has written a plugin to a webbrowser that records
> the order of a user's mouse and keyboard activity during a browsing
> session, and makes them concurrently available to other applications
> in a suite, which are written in varying lanugages.  The user takes
> some action, such as surfing to a site or clicking on a link.  The
> browser plugin records that sequence into shared memory, where it is
> marked as acknowledged by the listener programs, and recycled back
> into an unused block.  URLs, user inputs, and link text can be of any
> length, so truncating them to fit a fixed length is not an option.
> 
> Existing Solutions
> 
> - Shelve - A Python Standard Library shelf object can store a random
> access dictionary mapping strings to pickled objects.  It does not
> provide for hierarchical data stores, and objects must be unpickled
> before they can be examined.
> - Relational Database - Separate tables of nodes, attributes, and
> text, and the relations between them are slow and unwieldy to
> reproduce the contents of a dynamic structure.  The VARCHAR data type
> still carries a maximum size, no more flexible than fixed-length
> records.
> - POSH - Python Object Sharing - A module currently in its alpha stage
> promises to make it possible to store Python objects directly in
> shared memory.  In its current form, its only entry point is 'fork'
> and does not offer persistence, only sharing.  See:
>     http://poshmodule.sourceforge.net/
> 
> Dynamic Allocation
> 
> The traditional solution, dynamic memory allocation, is to maintain a
> metadata list of "free blocks" that are available to write to.  See:
>     http://en.wikipedia.org/wiki/Dynamic_memory_allocation
>     http://en.wikipedia.org/wiki/Malloc
>     http://en.wikipedia.org/wiki/Mmap
>     http://en.wikipedia.org/wiki/Memory_leak
> The catch, and the crux of the proposal, is that the metadata must be
> stored in shared memory along with the data themselves.  Assuming they
> are, a program can acquire the offset of an unused block of a
> sufficient size for its data, then write it to the file at that
> offset.  The metadata can maintain the offset of one root member, to
> serve as a 'table of contents' or header for the remainder of the
> file.  It can be grown and reassigned as needed.
> 
> An acquaintence writes: It could be quite useful for highly concurrent
> systems: the overhead involved with interprocess communication can be
> overwhelming, and something more flexible than normal object
> persistence to disk might be worth having.
> 
> Python Applicability
> 
> The usual problems with data persistence and sharing apply.  The
> format of the external data is only established conventionally, and
> conversions between Python objects and raw memory bytes take the usual
> overhead.  'struct.Struct', 'ctypes.Structure', and 'pickle.Pickler'
> currently offer this functionality, and the buffer offset obtained
> from 'alloc' can be used with all three.
> 
> Ex 1.
>     s= struct.Struct( 'III' )
>     x= alloc( s.size )
>     s.pack_into( mem, x, 2, 4, 6 )
> Struct in its current form does not permit random access into
> structure contents; a user must read or write the entire converted
> strucutre in order to update one field.  Alternative:
>     s= struct.Struct( 'I' )
>     x1, x2, x3= alloc( s.size ), alloc( s.size ), alloc( s.size )
>     s.pack_into( mem, x1, 2 )
>     s.pack_into( mem, x2, 4 )
>     s.pack_into( mem, x3, 6 )
> 
> Ex 2.
>     class Items( ctypes.Structure ):
>         _fields_= [
>             ( 'x1', ctypes.c_float ),
>             ( 'y1', ctypes.c_float ) ]
>     x= alloc( ctypes.sizeof( Items ) )
>     c= ctypes.cast( mem+ x, ctypes.POINTER( Items ) ).contents
>     c.x1, c.y1= 2, 4
> The 'mem' variable is obtained from a call to PyObject_AsWriteBuffer.
> 
> Ex 3.
>     s= pickle.dumps( ( 2, 4, 6 ) )
>     x= alloc( len( s ) )
>     mem[ x: x+ len( s ) ]= s
> 'dumps' is still slow and nor does permit random access into contents.
> 
> Use Cases Revisited
> 
> Use Case 1: Hierarchical ElementTree-style data
> Solution: Dynamically allocate the tree and its elements.
> 
> Node: tag: a
> Node: tag: b
> Node: tag: c
> Node: text: Foo
> 
> The user wants to change "Foo" to "Foobar".
> 
> Node: tag: a
> Node: tag: b
> Node: tag: c
> Node: text: Foobar
> 
> Deallocate 'Node: text: Foo', allocate 'Node: text: Foobar', and store
> the new offset into 'Node: tag: c'.  Total writes 6 bytes 'foobar', a
> one-word offset, and approximatly 5- 10-word metadata update.
> 
> Use Case 2: Web session logger
> Dynamically allocate a linked list of data points.
> 
> Data: 'friendster.com'
> Data: 'My Account'
> 
> Allocate one block for each string, adding it to a linked list.  As
> listeners acknowledge each data point, remove it from the linked
> list.  Keep the head node in the 'root offset' metadata field.
> 
> Restrictions
> 
> It is not possible for persistent memory to refer to live memory.  Any
> objects it refers to must also be located in file.  Their mapped
> addresses must not be stored, only their offsets into it.  However,
> live references to persistent memory are eminently possible.
> 
> Current Status
> 
> A pure Python alloc-free implementation based on the GNU PAVL tree
> library is on Google Code.  It is only in proof-of-concept form and
> not commented, but does contain a first-pass test suite.  See:
>     http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk
> The ctypes solution for access is advised.

You should review Zope's ZODB and/or memcached before putting in too much effort.

-Larry



More information about the Python-list mailing list