Fastest way to store ints and floats on disk

Sun Aug 10 00:41:45 EDT 2008

On Aug 9, 4:43 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Aug 10, 4:58 am, castironpi <castiro... at gmail.com> wrote:
>
>
>
> > On Aug 7, 2:27 pm, "M.-A. Lemburg" <m... at egenix.com> wrote:
>
> > > On 2008-08-07 20:41, Laszlo Nagy wrote:
>
> > > >  Hi,
>
> > > > I'm working on a pivot table. I would like to write it in Python. I
> > > > know, I should be doing that in C, but I would like to create a cross
> > > > platform version which can deal with smaller databases (not more than a
> > > > million facts).
>
> > > > The data is first imported from a csv file: the user selects which
> > > > columns contain dimension and measure data (and which columns to
> > > > ignore). In the next step I would like to build up a database that is
> > > > efficient enough to be used for making pivot tables. Here is my idea for
> > > > the database:
>
> > > > Original CSV file with column header and values:
>
> > > > "Color","Year","Make","Price","VMax"
> > > > Yellow,2000,Ferrari,100000,254
> > > > Blue,2003,Volvo,50000,210
>
> > > > Using the GUI, it is converted to this:
>
> > > > dimensions = [
> > > >    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
> > > > 'Yellow' ], },
> > > >    { 'name':'Year', colindex:1, 'values':[
> > > > 1995,1999,2000,2001,2002,2003,2007 ], },
> > > >    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
> > > > 'Lamborgini' ], },
> > > > ]
> > > > measures = [
> > > >    { 'name', 'Price', 'colindex':3 },
> > > >    { 'name', 'Vmax', 'colindex':4 },
> > > > ]
> > > > facts = [
> > > >    ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes,
> > > > measure_values )
> > > >    ( (1,5,1),(50000.0,210.0) ),
> > > >   .... # Some million rows or less
> > > > ]
>
> > > > The core of the idea is that, when using a relatively small number of
> > > > possible values for each dimension, the facts table becomes
> > > > significantly smaller and easier to process. (Processing the facts would
> > > > be: iterate over facts, filter out some of them, create statistical
> > > > values of the measures, grouped by dimensions.)
>
> > > > The facts table cannot be kept in memory because it is too big. I need
> > > > to store it on disk, be able to read incrementally, and make statistics.
> > > > In most cases, the "statistic" will be simple sum of the measures, and
> > > > counting the number of facts affected. To be effective, reading the
> > > > facts from disk should not involve complex conversions. For this reason,
> > > > storing in CSV or XML or any textual format would be bad. I'm thinking
> > > > about a binary format, but how can I interface that with Python?
>
> > > > I already looked at:
>
> > > > - xdrlib, which throws me DeprecationWarning when I store some integers
> > > > - struct which uses format string for each read operation, I'm concerned
> > > > about its speed
>
> > > > What else can I use?
>
> > >  >>> import marshal
> > >  >>> marshal.dump(1, open('test.db', 'wb'))
> > >  >>> marshal.load(open('test.db', 'rb'))
> > > 1
>
> > > It also very fast at dumping/loading lists, tuples, dictionaries,
> > > floats, etc.
>
> > Depending on how hard-core you want to be, store the int, float,
> > string, and long C structures directly to disk, at a given offset.
> > Either use fixed-length strings, or implement (or find) a memory
> > manager.  Anyone have a good alloc-realloc-free library, C or Python?
>
> A long time ago, when I last needed to bother about such things (to
> override the memory allocator in the DJGPP RTL), Doug Lea's malloc did
> the trick.
>
> A memory allocator written in Python? That's a novel concept.

For strings and longs, you need variable-length records.  Wrap Lea's
malloc in Python calls, and design a Python class where Year, Price,
and VMax are stored as ints at given offsets from the start of the
file, and Color and Make are stored as strings at given offsets.
Don't bother to cache, just seek and read.  Part 1 of the file, or
File 1, looks like:

Car 1 color_offset year_offset make_offset price_offset vmax_offset
Car 2 color_offset year_offset make_offset price_offset vmax_offset

Store them directly as bytes, not string reps. of numbers.

1024 1050 1054 1084 1088
1092 1112 1116 1130 1134

Part 2 looks like

1024 Yell
1028 ow
1050 2000
1054 Ferr
1058 ari
1084 100000
1088 254
1092 Blue
1112 2003
1116 Volv

and so on.