dynamic allocation file buffer

castironpi castironpi at gmail.com
Tue Sep 9 17:59:19 EDT 2008


I will try my idea again.  I want to talk to people about a module I
want to write and I will take the time to explain it.  I think it's a
"cool idea" that a lot of people, forgiving the slang, could benefit
from.  What are its flaws?

A user has a file he is using either 1/ to persist binary data after
the run of a single program (persistence) or 2/ share binary data
between concurrently running programs (IPC / shared memory).  The data
are records of variable types and lengths that can change over time.
He wants to change a record that's already present in the file.  Here
are two examples.

Use Case 1: Hierarchical ElementTree-style data

A user has an XML file like the one shown here.

<a>
  <b>
    <c>Foo</c>
  </b>
  ...

He wants to change "Foo" to "Foobar".

<a>
  <b>
    <c>Foobar</c>
  </b>
  ...

The change he wants to make is at the beginning of a 4GB file, and
recopying the remainder is an unacceptable resource drain.

Use Case 2: Web session logger

A tutor application has written a plugin to a webbrowser that records
the order of a user's mouse and keyboard activity during a browsing
session, and makes them concurrently available to other applications
in a suite, which are written in varying lanugages.  The user takes
some action, such as surfing to a site or clicking on a link.  The
browser plugin records that sequence into shared memory, where it is
marked as acknowledged by the listener programs, and recycled back
into an unused block.  URLs, user inputs, and link text can be of any
length, so truncating them to fit a fixed length is not an option.

Existing Solutions

- Shelve - A Python Standard Library shelf object can store a random
access dictionary mapping strings to pickled objects.  It does not
provide for hierarchical data stores, and objects must be unpickled
before they can be examined.
- Relational Database - Separate tables of nodes, attributes, and
text, and the relations between them are slow and unwieldy to
reproduce the contents of a dynamic structure.  The VARCHAR data type
still carries a maximum size, no more flexible than fixed-length
records.
- POSH - Python Object Sharing - A module currently in its alpha stage
promises to make it possible to store Python objects directly in
shared memory.  In its current form, its only entry point is 'fork'
and does not offer persistence, only sharing.  See:
    http://poshmodule.sourceforge.net/

Dynamic Allocation

The traditional solution, dynamic memory allocation, is to maintain a
metadata list of "free blocks" that are available to write to.  See:
    http://en.wikipedia.org/wiki/Dynamic_memory_allocation
    http://en.wikipedia.org/wiki/Malloc
    http://en.wikipedia.org/wiki/Mmap
    http://en.wikipedia.org/wiki/Memory_leak
The catch, and the crux of the proposal, is that the metadata must be
stored in shared memory along with the data themselves.  Assuming they
are, a program can acquire the offset of an unused block of a
sufficient size for its data, then write it to the file at that
offset.  The metadata can maintain the offset of one root member, to
serve as a 'table of contents' or header for the remainder of the
file.  It can be grown and reassigned as needed.

An acquaintence writes: It could be quite useful for highly concurrent
systems: the overhead involved with interprocess communication can be
overwhelming, and something more flexible than normal object
persistence to disk might be worth having.

Python Applicability

The usual problems with data persistence and sharing apply.  The
format of the external data is only established conventionally, and
conversions between Python objects and raw memory bytes take the usual
overhead.  'struct.Struct', 'ctypes.Structure', and 'pickle.Pickler'
currently offer this functionality, and the buffer offset obtained
from 'alloc' can be used with all three.

Ex 1.
    s= struct.Struct( 'III' )
    x= alloc( s.size )
    s.pack_into( mem, x, 2, 4, 6 )
Struct in its current form does not permit random access into
structure contents; a user must read or write the entire converted
strucutre in order to update one field.  Alternative:
    s= struct.Struct( 'I' )
    x1, x2, x3= alloc( s.size ), alloc( s.size ), alloc( s.size )
    s.pack_into( mem, x1, 2 )
    s.pack_into( mem, x2, 4 )
    s.pack_into( mem, x3, 6 )

Ex 2.
    class Items( ctypes.Structure ):
        _fields_= [
            ( 'x1', ctypes.c_float ),
            ( 'y1', ctypes.c_float ) ]
    x= alloc( ctypes.sizeof( Items ) )
    c= ctypes.cast( mem+ x, ctypes.POINTER( Items ) ).contents
    c.x1, c.y1= 2, 4
The 'mem' variable is obtained from a call to PyObject_AsWriteBuffer.

Ex 3.
    s= pickle.dumps( ( 2, 4, 6 ) )
    x= alloc( len( s ) )
    mem[ x: x+ len( s ) ]= s
'dumps' is still slow and nor does permit random access into contents.

Use Cases Revisited

Use Case 1: Hierarchical ElementTree-style data
Solution: Dynamically allocate the tree and its elements.

Node: tag: a
Node: tag: b
Node: tag: c
Node: text: Foo

The user wants to change "Foo" to "Foobar".

Node: tag: a
Node: tag: b
Node: tag: c
Node: text: Foobar

Deallocate 'Node: text: Foo', allocate 'Node: text: Foobar', and store
the new offset into 'Node: tag: c'.  Total writes 6 bytes 'foobar', a
one-word offset, and approximatly 5- 10-word metadata update.

Use Case 2: Web session logger
Dynamically allocate a linked list of data points.

Data: 'friendster.com'
Data: 'My Account'

Allocate one block for each string, adding it to a linked list.  As
listeners acknowledge each data point, remove it from the linked
list.  Keep the head node in the 'root offset' metadata field.

Restrictions

It is not possible for persistent memory to refer to live memory.  Any
objects it refers to must also be located in file.  Their mapped
addresses must not be stored, only their offsets into it.  However,
live references to persistent memory are eminently possible.

Current Status

A pure Python alloc-free implementation based on the GNU PAVL tree
library is on Google Code.  It is only in proof-of-concept form and
not commented, but does contain a first-pass test suite.  See:
    http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk
The ctypes solution for access is advised.



More information about the Python-list mailing list