dynamic allocation file buffer

Fri Sep 12 08:39:11 EDT 2008

On Sep 12, 4:34 am, Paul Boddie <p... at boddie.org.uk> wrote:
> On 12 Sep, 08:30, Steven D'Aprano
>
> <ste... at REMOVE.THIS.cybersource.com.au> wrote:
>
> > Which is why I previously said that XML was not well suited for random
> > access.
>
> Maybe not.

No, it's not.  Element trees are, which if I just would have said
originally...

> A consideration of other storage formats such as HDF5 might
> be appropriate:
>
> http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html
>
> There are, of course, HDF5 tools available for Python.

PyTables came up within the past few weeks on the list.

"When the file is created, the metadata in the object tree is updated
in memory while the actual data is saved to disk. When you close the
file the object tree is no longer available. However, when you reopen
this file the object tree will be reconstructed in memory from the
metadata on disk...."

This is different from what I had in mind, but the extremity depends
on how slow the 'reconstructed in memory' step is.  (From
http://www.pytables.org/docs/manual/ch01.html#id2506782 ).  The
counterexample would be needing random access into multiple data
files, which don't all fit in memory at once, but the maturity of the
package might outweigh that.  Reconstruction will form a bottleneck
anyway.

> > I think we're starting to be sucked into a vortex of obtuse and opaque
> > communication.
>
> I don't know about that. I'm managing to keep up with the discussion.
>
> > We agree that XML can store hierarchical data, and that it
> > has to be read and written sequentially, and that whatever the merits of
> > castironpi's software, his original use-case of random access to a 4GB
> > XML file isn't workable. Yes?

I could renege that bid and talk about a 4MB file, where recopying is
prohibitively expensive and so random access is needed, thereby
requiring an alternative to XML.

> Again, XML specifically might not be workable for random access in a
> serialised form, despite people's best efforts at processing it in
> various unconventional ways, but that doesn't mean that random access
> to a 4GB file containing hierarchical data isn't possible, so I
> suppose it depends on whether he is wedded to the idea of using
> vanilla XML or not.

No.  It is always nice to be able to scroll through your data, but
it's much less common to be able to scroll though a data -structure-.
(Which is part of the reason data structures are hard to design.)

> It's always worth exploring the available
> alternatives before embarking on a challenging project, unless one
> wants to pursue the exercise as a learning experience, and I therefore
> suggest investigating whether HDF5 doesn't already solve at least some
> of the problems or use-cases stated in this discussion.

The potential for concurrency is definitely one benefit of raw alloc/
free management, and a requirement I was setting out to program
directly for.  There is a multi-threaded version of HDF5 but
interprocess communication is unsupported.

"This version serializes the API suitable for use in a multi-threaded
application but does not provide any level of concurrency."

From: http://www.hdfgroup.uiuc.edu/papers/features/mthdf/

(It is always appreciated to find a statement of what a product does
not do.)

> Paul

There is an updated statement of the problem on the project website:

http://code.google.com/p/pymmapstruct/source/browse/trunk/pymmapstruct.txt

I don't have numbers for my claim that the abstraction layers in SQL,
including string construction and parsing, are ever a bottleneck or
limiting factor, despite that it's sort of intuitive.  Until I get
those, maybe I should leave that allegation out.

Compared to the complexity of all these other packages (ZOPE,
memcached, HDF5/PyTables), alloc and free are almost looking like they
should become methods on a subclass of the builtin buffer type.  Ha!
(Ducks.)  They're beyond dangerous compared to the snuggly feeling of
Python though, so maybe they could belong in ctypes.

Aaron