Fastest way to store ints and floats on disk

Sat Aug 9 14:58:57 EDT 2008

On Aug 7, 2:27 pm, "M.-A. Lemburg" <m... at egenix.com> wrote:
> On 2008-08-07 20:41, Laszlo Nagy wrote:
>
>
>
>
>
> >  Hi,
>
> > I'm working on a pivot table. I would like to write it in Python. I
> > know, I should be doing that in C, but I would like to create a cross
> > platform version which can deal with smaller databases (not more than a
> > million facts).
>
> > The data is first imported from a csv file: the user selects which
> > columns contain dimension and measure data (and which columns to
> > ignore). In the next step I would like to build up a database that is
> > efficient enough to be used for making pivot tables. Here is my idea for
> > the database:
>
> > Original CSV file with column header and values:
>
> > "Color","Year","Make","Price","VMax"
> > Yellow,2000,Ferrari,100000,254
> > Blue,2003,Volvo,50000,210
>
> > Using the GUI, it is converted to this:
>
> > dimensions = [
> >    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
> > 'Yellow' ], },
> >    { 'name':'Year', colindex:1, 'values':[
> > 1995,1999,2000,2001,2002,2003,2007 ], },
> >    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
> > 'Lamborgini' ], },
> > ]
> > measures = [
> >    { 'name', 'Price', 'colindex':3 },
> >    { 'name', 'Vmax', 'colindex':4 },
> > ]
> > facts = [
> >    ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes,
> > measure_values )
> >    ( (1,5,1),(50000.0,210.0) ),
> >   .... # Some million rows or less
> > ]
>
> > The core of the idea is that, when using a relatively small number of
> > possible values for each dimension, the facts table becomes
> > significantly smaller and easier to process. (Processing the facts would
> > be: iterate over facts, filter out some of them, create statistical
> > values of the measures, grouped by dimensions.)
>
> > The facts table cannot be kept in memory because it is too big. I need
> > to store it on disk, be able to read incrementally, and make statistics.
> > In most cases, the "statistic" will be simple sum of the measures, and
> > counting the number of facts affected. To be effective, reading the
> > facts from disk should not involve complex conversions. For this reason,
> > storing in CSV or XML or any textual format would be bad. I'm thinking
> > about a binary format, but how can I interface that with Python?
>
> > I already looked at:
>
> > - xdrlib, which throws me DeprecationWarning when I store some integers
> > - struct which uses format string for each read operation, I'm concerned
> > about its speed
>
> > What else can I use?
>
>  >>> import marshal
>  >>> marshal.dump(1, open('test.db', 'wb'))
>  >>> marshal.load(open('test.db', 'rb'))
> 1
>
> It also very fast at dumping/loading lists, tuples, dictionaries,
> floats, etc.

Depending on how hard-core you want to be, store the int, float,
string, and long C structures directly to disk, at a given offset.
Either use fixed-length strings, or implement (or find) a memory
manager.  Anyone have a good alloc-realloc-free library, C or Python?