[Numpy-discussion] Histograms of extremely large data sets

Thu Dec 14 22:43:46 EST 2006

Using Eric's latest speed-testing, here's David's results:

cameron at cameron-laptop:~/code_snippets/histogram$ python histogram_speed.py
type: uint8
millions of elements: 100.0
sec (C indexing based): 8.44 100000000
sec (numpy iteration based): 8.91 100000000
sec (rick's pure python): 6.4 100000000
sec (nd evenly spaced): 2.1 100000000
sec (1d evenly spaced): 1.33 100000000
sec (david huard): 35.84 100000000

Summary:
                case    sec     speed-up
  weave_1d_arbitrary    8.440000        0.758294
  weave_nd_arbitrary    8.910000        0.718294
     ricks_arbitrary    6.400000        1.000000
       weave_nd_even    2.100000        3.047619
       weave_1d_even    1.330000        4.812030
         david_huard    35.840000       0.178571

I also tried this on an equal-sized sample of my real-world data: 100
image slices, 8bits/sample, 1000x1000 pixels per image.  The full data
set is 489 image slices, but I was unable to randomly generate 489
million data samples because I ran out of memory and started thrashing
the page file, ruining any results.  So I've compared like with like
and got the following results with real-world data:

type: uint8
millions of elements: 100.0
sec (C indexing based): 6.1 100000000
sec (numpy iteration based): 7.07 100000000
sec (rick's pure python): 4.77 100000000
sec (nd evenly spaced): 2.12 100000000
sec (1d evenly spaced): 1.33 100000000
sec (david huard): 16.47 100000000

Summary:
                case    sec     speed-up
  weave_1d_arbitrary    6.100000        0.781967
  weave_nd_arbitrary    7.070000        0.674682
     ricks_arbitrary    4.770000        1.000000
       weave_nd_even    2.120000        2.250000
       weave_1d_even    1.330000        3.586466
         david_huard    16.470000       0.289617

Note how much faster some of the algorithms run on the non-random,
real-world data.  I assume this is due to variations in the scaling of
the quick-sort algorithm depending on the starting order of the data?

Scaling with the full data set was similar.  Unfortunately, David's
code was not able to load the entire 489 image slices, throwing the
same error as that mentioned in the first email in this thread.

Later parts of the project I am working on will probably require
iteration over the entire data set, and iteration seems to be slowing
down several of these histogram algorithms, requiring the sort()
approach.  I'll have a look at the iterator, and see if there's
anything that can be done there instead.  I'm hoping that it will be
possible to use a C-based iterator for a numpy multiarray, as this
would allow many data processing algorithms to run faster, not just
the histogram.

Once again, thanks to everyone for all your input.  This seems to have
generated more discussion and action than I anticipated, for which I
am very grateful.

Best regards,

Cameron.