Fastest way to store ints and floats on disk

M.-A. Lemburg mal at egenix.com
Thu Aug 7 15:27:01 EDT 2008


On 2008-08-07 20:41, Laszlo Nagy wrote:
> 
>  Hi,
> 
> I'm working on a pivot table. I would like to write it in Python. I 
> know, I should be doing that in C, but I would like to create a cross 
> platform version which can deal with smaller databases (not more than a 
> million facts).
> 
> The data is first imported from a csv file: the user selects which 
> columns contain dimension and measure data (and which columns to 
> ignore). In the next step I would like to build up a database that is 
> efficient enough to be used for making pivot tables. Here is my idea for 
> the database:
> 
> Original CSV file with column header and values:
> 
> "Color","Year","Make","Price","VMax"
> Yellow,2000,Ferrari,100000,254
> Blue,2003,Volvo,50000,210
> 
> Using the GUI, it is converted to this:
> 
> dimensions = [
>    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green', 
> 'Yellow' ], },
>    { 'name':'Year', colindex:1, 'values':[ 
> 1995,1999,2000,2001,2002,2003,2007 ], },
>    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford', 
> 'Lamborgini' ], },
> ]
> measures = [
>    { 'name', 'Price', 'colindex':3 },
>    { 'name', 'Vmax', 'colindex':4 },
> ]
> facts = [
>    ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes, 
> measure_values )
>    ( (1,5,1),(50000.0,210.0) ),
>   .... # Some million rows or less
> ]
> 
> 
> The core of the idea is that, when using a relatively small number of 
> possible values for each dimension, the facts table becomes 
> significantly smaller and easier to process. (Processing the facts would 
> be: iterate over facts, filter out some of them, create statistical 
> values of the measures, grouped by dimensions.)
> 
> The facts table cannot be kept in memory because it is too big. I need 
> to store it on disk, be able to read incrementally, and make statistics. 
> In most cases, the "statistic" will be simple sum of the measures, and 
> counting the number of facts affected. To be effective, reading the 
> facts from disk should not involve complex conversions. For this reason, 
> storing in CSV or XML or any textual format would be bad. I'm thinking 
> about a binary format, but how can I interface that with Python?
> 
> I already looked at:
> 
> - xdrlib, which throws me DeprecationWarning when I store some integers
> - struct which uses format string for each read operation, I'm concerned 
> about its speed
> 
> What else can I use?

 >>> import marshal
 >>> marshal.dump(1, open('test.db', 'wb'))
 >>> marshal.load(open('test.db', 'rb'))
1

It also very fast at dumping/loading lists, tuples, dictionaries,
floats, etc.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 07 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611



More information about the Python-list mailing list