Fastest way to store ints and floats on disk

Thu Aug 7 17:08:37 EDT 2008

On Aug 7, 1:41 pm, Laszlo Nagy <gand... at shopzeus.com> wrote:
>   Hi,
>
> I'm working on a pivot table. I would like to write it in Python. I
> know, I should be doing that in C, but I would like to create a cross
> platform version which can deal with smaller databases (not more than a
> million facts).
>
> The data is first imported from a csv file: the user selects which
> columns contain dimension and measure data (and which columns to
> ignore). In the next step I would like to build up a database that is
> efficient enough to be used for making pivot tables. Here is my idea for
> the database:
>
> Original CSV file with column header and values:
>
> "Color","Year","Make","Price","VMax"
> Yellow,2000,Ferrari,100000,254
> Blue,2003,Volvo,50000,210
>
> Using the GUI, it is converted to this:
>
> dimensions = [
>     { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
> 'Yellow' ], },
>     { 'name':'Year', colindex:1, 'values':[
> 1995,1999,2000,2001,2002,2003,2007 ], },
>     { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
> 'Lamborgini' ], },
> ]
> measures = [
>     { 'name', 'Price', 'colindex':3 },
>     { 'name', 'Vmax', 'colindex':4 },
> ]
> facts = [
>     ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes,
> measure_values )
>     ( (1,5,1),(50000.0,210.0) ),
>    .... # Some million rows or less
> ]
>
> The core of the idea is that, when using a relatively small number of
> possible values for each dimension, the facts table becomes
> significantly smaller and easier to process. (Processing the facts would
> be: iterate over facts, filter out some of them, create statistical
> values of the measures, grouped by dimensions.)
>
> The facts table cannot be kept in memory because it is too big. I need
> to store it on disk, be able to read incrementally, and make statistics.
> In most cases, the "statistic" will be simple sum of the measures, and
> counting the number of facts affected. To be effective, reading the
> facts from disk should not involve complex conversions. For this reason,
> storing in CSV or XML or any textual format would be bad. I'm thinking
> about a binary format, but how can I interface that with Python?
>
> I already looked at:
>
> - xdrlib, which throws me DeprecationWarning when I store some integers
> - struct which uses format string for each read operation, I'm concerned
> about its speed
>
> What else can I use?
>
> Thanks,
>
>    Laszlo

Take a look at the mmap module.  You get direct memory access, backed
by the file system.  struct + mmap, if you keep your strings small?