OLAP and pivot tables

Fri May 26 16:33:20 EDT 2006

Ben Stroud wrote:
> George Sakkis wrote:
> 
>> After a brief search, I didn't find any python package related to OLAP
>> and pivot tables. Did I miss anything ? To be more precise, I'm not so
>> interested in a full-blown OLAP server with an RDBMS backend, but
>> rather a pythonic API for constructing datacubes in memory, slicing and
>> dicing them, drilling down or up dimensions and exposing them in some
>> suitable form to a presentation layer. I've hacked a first cut of a
>> pivot table implementation and an XHTML generator that produces
>> hierarchical html tables but it's not particularly general or easily
>> extensible so far. Is there any interest at all on a pythonic version
>> of something like JOLAP or XMLA ?
>>
> I'd be interested as well.  I posted a similar question to the ruby 
> mailing list a few months ago to no avail.  Ideally, someone much more 
> talented than myself would create a open OLAP library in C that could be 
> interfaced with dynamic languages easily (I ordered some OLAP books and 
> started in on this, and decided I was in over my head for now).  As far 
> as free software, all I've been able to find is java-based Mondrian.  
> Maybe it could serve as a reference implementation for someone.

The NetEpi Analysis project - see http://sourceforge.net/projects/netepi
, although not strictly an OLAP or datacube engine, might offer some of
the things you are looking for. It is intended for exploratory
epidemiological analysis of (potentially large) health-related datasets,
but should work with most types of data for which an OLAP engine would
be useful. Underneath there is a vertically-disaggregated,
ordinally-mapped, set-theoretic data selection and summarisation engine,
which is a pompous way of saying that it holds data column-wise in
memory-mapped Numpy (Numeric Python) arrays, and uses some fast
(custom-written) set functions on inverted indexes on the ordinal
positions of column values to select and summarise data (entirely at
run-time, cf most OLAP engines, which rely on a degree of
pre-summarisation along pre-chosen dimensions). It is all Python and
thus has a  Python(ic) API, including an SQL-like WHERE clause parser
for data selection (OK, SQL is not Pythonic, but that's just for data
subsetting). It includes quite a few statistical functions and nice
graphics courtesy of R (http://www.r-project.org) (which is embedded via
RPy - http://rpy.sourceforge.net/). Full support for missing values and
weighted datasets is provided (but not full support for survey data with
complex sample designs - that's forthcoming). Currently it works well
with datasets in the 5-10 million row range, but the basic design lends
itself easily to parallelisation if you have bigger datasets, and
preliminary work indicates good speed improvements - something we want
to pursue given all these multi-core CPUs which are now available at
reasonable cost. Be warned that NetEpi Analysis is currently only of
beta quality, and is a bit of a pig to install, on Linux/Unix/Mac OS X
only at present. We hope to be able to ready a production-ready Version
1.0 by the end of 2006, possibly with MS-Windows support as well.
However, the core data summarisation/subsetting engine is thought to be
sound (and there are some unit tests to attest to that).

Probably not quite what you were after but I thought it worth a mention.
Please post follow-ups, if any, to the NetEpi mailing list:
http://sourceforge.net/mail/?group_id=123700

Tim C

> 
> Cheers,
> Ben