Fastest way to store ints and floats on disk

Emile van Sebille emile at fenx.com
Thu Aug 7 18:43:50 EDT 2008


Laszlo Nagy wrote:
> 
>  Hi,
> 
> I'm working on a pivot table. 

Hmm... I wrote an browser based analysis tool and used the working name 
pyvot...

I found Numeric to provide the best balance of memory footprint and 
speed.  I also segregated data prep into a separate process to avoid 
excessive memory use at run time.  Turns out python

For the site I'm at, I've got 10 years sales history recapped from 
4327846 detail records into 458197 item by customer by month records and 
top shows a 240Mb memory footprint.  I've got 21 cross indexed selection 
fields, and can display up to six data types (qty, price, sqft, cost, 
gp%, avg).  At another site I've got approx 8.5M records recapped into 
1M records with 15 indexes and 5 years monthly history living in a 540Mb 
memory footprint.

It's reasonably quick: a query like 'select san mateo, foster city and 
san carlos accounts, sort by customer and product category and display 
this year's sales by month' selects 260 records and renders in the 
browser in about 2 seconds. Or on the larger installation 'Show sales 
for the past five years for product group 12 sorted by city within 
route' selects 160 records and renders in about 3 seconds.

My objective was to keep the info in memory for fast response times.

I played a lot of games getting this all to work well, including some c 
extensions, but Numeric's take, sum, tostring and fromstring ended up 
with 'pivotal' roles.  :)

Regards,

Emile


> I would like to write it in Python. I 
> know, I should be doing that in C, but I would like to create a cross 
> platform version which can deal with smaller databases (not more than a 
> million facts).
> 
> The data is first imported from a csv file: the user selects which 
> columns contain dimension and measure data (and which columns to 
> ignore). In the next step I would like to build up a database that is 
> efficient enough to be used for making pivot tables. Here is my idea for 
> the database:
> 
> Original CSV file with column header and values:
> 
> "Color","Year","Make","Price","VMax"
> Yellow,2000,Ferrari,100000,254
> Blue,2003,Volvo,50000,210
> 
> Using the GUI, it is converted to this:
> 
> dimensions = [
>    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green', 
> 'Yellow' ], },
>    { 'name':'Year', colindex:1, 'values':[ 
> 1995,1999,2000,2001,2002,2003,2007 ], },
>    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford', 
> 'Lamborgini' ], },
> ]
> measures = [
>    { 'name', 'Price', 'colindex':3 },
>    { 'name', 'Vmax', 'colindex':4 },
> ]
> facts = [
>    ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes, 
> measure_values )
>    ( (1,5,1),(50000.0,210.0) ),
>   .... # Some million rows or less
> ]
> 
> 
> The core of the idea is that, when using a relatively small number of 
> possible values for each dimension, the facts table becomes 
> significantly smaller and easier to process. (Processing the facts would 
> be: iterate over facts, filter out some of them, create statistical 
> values of the measures, grouped by dimensions.)
> 
> The facts table cannot be kept in memory because it is too big. I need 
> to store it on disk, be able to read incrementally, and make statistics. 
> In most cases, the "statistic" will be simple sum of the measures, and 
> counting the number of facts affected. To be effective, reading the 
> facts from disk should not involve complex conversions. For this reason, 
> storing in CSV or XML or any textual format would be bad. I'm thinking 
> about a binary format, but how can I interface that with Python?
> 
> I already looked at:
> 
> - xdrlib, which throws me DeprecationWarning when I store some integers
> - struct which uses format string for each read operation, I'm concerned 
> about its speed
> 
> What else can I use?
> 
> Thanks,
> 
>   Laszlo
> 
> 
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list
> 




More information about the Python-list mailing list