Fastest way to store ints and floats on disk

Sat Aug 9 17:43:36 EDT 2008

On Aug 10, 4:58 am, castironpi <castiro... at gmail.com> wrote:
> On Aug 7, 2:27 pm, "M.-A. Lemburg" <m... at egenix.com> wrote:
>
>
>
> > On 2008-08-07 20:41, Laszlo Nagy wrote:
>
> > >  Hi,
>
> > > I'm working on a pivot table. I would like to write it in Python. I
> > > know, I should be doing that in C, but I would like to create a cross
> > > platform version which can deal with smaller databases (not more than a
> > > million facts).
>
> > > The data is first imported from a csv file: the user selects which
> > > columns contain dimension and measure data (and which columns to
> > > ignore). In the next step I would like to build up a database that is
> > > efficient enough to be used for making pivot tables. Here is my idea for
> > > the database:
>
> > > Original CSV file with column header and values:
>
> > > "Color","Year","Make","Price","VMax"
> > > Yellow,2000,Ferrari,100000,254
> > > Blue,2003,Volvo,50000,210
>
> > > Using the GUI, it is converted to this:
>
> > > dimensions = [
> > >    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
> > > 'Yellow' ], },
> > >    { 'name':'Year', colindex:1, 'values':[
> > > 1995,1999,2000,2001,2002,2003,2007 ], },
> > >    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
> > > 'Lamborgini' ], },
> > > ]
> > > measures = [
> > >    { 'name', 'Price', 'colindex':3 },
> > >    { 'name', 'Vmax', 'colindex':4 },
> > > ]
> > > facts = [
> > >    ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes,
> > > measure_values )
> > >    ( (1,5,1),(50000.0,210.0) ),
> > >   .... # Some million rows or less
> > > ]
>
> > > The core of the idea is that, when using a relatively small number of
> > > possible values for each dimension, the facts table becomes
> > > significantly smaller and easier to process. (Processing the facts would
> > > be: iterate over facts, filter out some of them, create statistical
> > > values of the measures, grouped by dimensions.)
>
> > > The facts table cannot be kept in memory because it is too big. I need
> > > to store it on disk, be able to read incrementally, and make statistics.
> > > In most cases, the "statistic" will be simple sum of the measures, and
> > > counting the number of facts affected. To be effective, reading the
> > > facts from disk should not involve complex conversions. For this reason,
> > > storing in CSV or XML or any textual format would be bad. I'm thinking
> > > about a binary format, but how can I interface that with Python?
>
> > > I already looked at:
>
> > > - xdrlib, which throws me DeprecationWarning when I store some integers
> > > - struct which uses format string for each read operation, I'm concerned
> > > about its speed
>
> > > What else can I use?
>
> >  >>> import marshal
> >  >>> marshal.dump(1, open('test.db', 'wb'))
> >  >>> marshal.load(open('test.db', 'rb'))
> > 1
>
> > It also very fast at dumping/loading lists, tuples, dictionaries,
> > floats, etc.
>
> Depending on how hard-core you want to be, store the int, float,
> string, and long C structures directly to disk, at a given offset.
> Either use fixed-length strings, or implement (or find) a memory
> manager.  Anyone have a good alloc-realloc-free library, C or Python?

A long time ago, when I last needed to bother about such things (to
override the memory allocator in the DJGPP RTL), Doug Lea's malloc did
the trick.

A memory allocator written in Python? That's a novel concept.