[Numpy-discussion] How to implement a 'pivot table?'

Geoffrey Zhu zyzhu2000 at gmail.com
Tue Jul 31 11:00:18 EDT 2007


Hi Timothy,

On 7/31/07, Timothy Hochberg <tim.hochberg at ieee.org> wrote:

> [SNIP]
>
> > The 'brute-force' way is basically what you suggested -- looping
> > through all the records and building a two-way hash-table of the data.
> >
> > The problem of the brute-force' approach is that it is not taking
> > advantage of facilities of numpy and can be slow in speed. If only
> > there is some built-in mechanism in numpy to handle this.
>
> I'm curious; have you tried this and found it slow, or is this a hunch based
> on the reputed slowness of Python? Algorithmically, you can't do much
> better: the dictionary and set operations are O(1), so the whole operation
> is O(N), and you won't do any better than that, order wise. What your left
> with is trying to reduce constant factors.

I agree that algorithmically you can't do much better. It is basically
a C vs Python thing. One advantage of numpy is that you can do
vectorized operations at the speed of C. With this algorithm, we have
to process the data element by element and the speed advantage of
numpy is lost. Since data has to be stored in python sets and maps, I
imagine the storage advantage is also lost.

I was in fact looking for some implemention of this algorithm in numpy
(and so C) that does exactly this, or some implementation of this
algorithm that can leverage the fast numpy routines to do this.

I haven't tried it with the real data load yet. I know the number of
records will be huge and it is just a hunch that it will be slow.

> There are various ways one might go about reducing constant factors, but
> they depend on the details of the problem. For example, if the dates are
> dense and you are going to parse them anyway, you could replace the hash
> with table that you index into with the date as an integer. I'm not sure
> that you are going to do a lot better than the brute force algorithm in the
> generic force case though.

Unfortunately it has to be something generic.

Thanks a lot for your help.
Geoffrey



More information about the NumPy-Discussion mailing list