[Numpy-discussion] Anyone written an SQL-like interface to numpy/PyTables?

Mon Jun 18 11:19:33 EDT 2007

I've often thought it would be interesting if someone would build a custom
table adapter to use PyTables in SQLlite.  Ie, essentially bolting a SQL
parser and query engine on top of PyTables.  Unfortunately, I don't have
time to do this, though hopefully someone will at some point.

-Kevin

On 6/18/07, Stephen Simmons <mail at stevesimmons.com> wrote:
>
> Hi,
>
> Has anyone written a parser for SQL-like queries against PyTables HDF
> tables or numpy recarrays?
>
> I'm asking because I have written code for grouping then summing rows of
> source data, where the groups are defined by functions of the source
> data, or looking up a related field in a separate lookup tables. I use
> this for tracking the performance of customer segments by status, with
> 48 monthly files of product usage/customer status data on 5m customers.
> Each of these is a 5m row HDF file, with several other 5m row HDF files
> that are used to work out which segment a customer belongs to.
>
> This grouping and summing is equivalent to something like the following
> SQL code:
>     SELECT
>         grp_fn1(table1.*), grp_fn2(table1.*), grp_fn3(table2.*),
> grp_fn4(table3.*),
>         count(table1.*),
>         sum(table1.field1), sum(table1.field2), ..., sum(table1.fieldK)
>     FROM table1 PARTITION(date)
>     LEFT JOIN table2 ON table1.field0=table2.field0
>     LEFT JOIN table3 ON table2.field0=table3.field0
>     WHERE min_date<=date<=max_date
>     GROUP BY grp_fn1(table1.*), grp_fn2(table1.*), grp_fn3(table2.*),
> grp_fn4(table3.*)
>
> I'm using numpy.bincount() function to do the grouping/summing,
> numpy.searchsorted() for fast lookup tables implementing the grouping
> functions grp_fn(), and some other C functions for a fast "zip" join of
> related tables whose primary keys are in the same order as the monthly
> date partitions.
>
> The Python code that specifies the grouping/summing fields looks like
> this:
>    agg = HDFAggregator('table1.hdf')
>    agg.add_group_function('MONTH', ...<min_date, max_date> )
>    agg.add_group_function('SEGMENT', ... <lookup on table 2>  )
>    agg.add_group_function('STATUS', ... <lookup on table 3> )
>    agg.do_aggregation(groupby='MONTH SEGMENT STATUS', count='CUST_NO',
> sum='<list of fields to sum>)
>    agg.add_calculated_field('PROFIT', 'VOLUME*(PRICE* (1-DISCOUNT)-COGS)
> - COSTOFSALES')
>    agg.save('output.hdf')
>
> On my laptop, this zips over my data at a speed of 400k rows/sec,
> aggregating it into 230,000 groups (48 months x 120 customer
> segments/subsegments x 5 product groups x 8 statuses) with subtotals for
> 30 data fields in each group. This is essentially as fast as PyTables
> can read in the HDF files from disk; peak speeds with fewer groups (e.g.
> 48x5x1x4) are above 1Mrows/sec if the HDF files are already in the disk
> cache.
>
> One option I am considering now is bolting an SQL-like parser on the
> front to provide a more natural interface for those unfortunate people
> who prefer SQL to Python. I don't want to write an SQL parser from
> scratch, so it would be great to know if there are any existing projects
> to put an SQL-like interface on numpy or PyTables (other than numexpr).
>
> So has anyone looked at using an SQL-like syntax for querying
> numpy/PyTables data?
>
> Cheers
>
> Stephen
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070618/e695b320/attachment.html>