[Numpy-discussion] [Newbie] Fast plotting

Wed Jan 7 07:37:53 EST 2009

Hi all,

First, let me say that I'm impressed: this mailing list is probably the
most reactive I've ever seen. I've asked my first question and got
immediately more solutions than time to test them... Many thanks to all
the answerers.

Using the various proposals, I ran two performance tests:
 - test 1: 2000000 random values
 - test 2: 1328724 values from my real use case

Here are the various functions and how they perform:

def f0 (x, y) :
    """Initial version

    test 1 CPU times: 13.37s
    test 2 CPU times: 5.92s
    """
    s, n = {}, {}
    for a, b in zip(x, y) :
        s[a] = s.get(a, 0.0) + b
        n[a] = n.get(a, 0) + 1
    return (numpy.array([a for a in sorted(s)]),
            numpy.array([s[a]/n[a] for a in sorted(s)]))

def f1 (x, y) :
    """Alan G Isaac <aisaac at american.edu>
    Modified in order to sort the result only once.

    test 1 CPU times: 10.86s
    test 2 CPU times: 2.78s

    defaultdict indeed speeds things up, probably avoiding one of two
    sorts is good also
    """
    s, n = defaultdict(float), defaultdict(int)
    for a, b in izip(x, y) :
        s[a] += b
        n[a] += 1
    new_x = numpy.array([a for a in sorted(s)])
    return (new_x,
            numpy.array([s[a]/n[a] for a in new_x]))

def f2 (x, y) :
    """Francesc Alted <faltet at pytables.org>
    Modified with preallocation of arrays (it appeared faster)

    test 1: killed after more than 10 minutes
    test 2 CPU times: 22.01s

    This result is not surprising as I guess a quadratic complexity: one
    pass for each unique value in x, and presumably one nested pass to
    compute y[x==i]
    """
    u = numpy.unique(x)
    m = numpy.array(range(len(u)))
    for pos, i in enumerate(u) :
        g = y[x == i]
        m[pos] = g.mean()
    return u, m

def f3 (x, y) :
    """Sebastian Stephan Berg <sebastian at sipsolutions.net>
    Modified because I can always work in place.

    test 1 CPU times: 17.43s
    test 2 CPU times: 0.21s

    Adopted! This is definitely the fastest one when using real values.
    I tried to preallocate arrays by setting u=numpy.unique(x) and the
    looping on u, but the result is slower, probably because of unique()

    Compared with f1, its slower on larger arrays of random values. It
    may be explained by a complexity argument: f1 as a linear complexity
    (two passes in sequence) while f3 is probably N log N (a sequence of
    one sort, two passes to set x[:] and y[:] and one loop on each
    distinct value with a nested searchsorted that is probably
    logarithmic). But, real values are far from random, and the sort is
    probably more efficient, as well as the while loop is shorter
    because there are less values.
    """
    s = x.argsort()
    x[:] = x[s]
    y[:] = y[s]
    u, means, start, value = [], [], 0, x[0]
    while True:
        next = x.searchsorted(value, side='right')
        u.append(value)
        means.append(y[start:next].mean())
        if next == len(x):
            break
        value = x[next]
        start = next
    return numpy.array(u), numpy.array(means)

def f4 (x, y) :
    """Jean-Baptiste Rudant <boogaloojb at yahoo.fr>

    test 1 CPU times: 111.21s
    test 2 CPU times: 13.48s

    As Jean-Baptiste noticed, this solution is not very efficient (but
    works almost of-the-shelf).
    """
    recXY = numpy.rec.fromarrays((x, x), names='x, y')
    return matplotlib.mlab.rec_groupby(recXY, ('x',),
                                       (('y', numpy.mean, 'y_avg'),))

A few more remarks.

Sebastian Stephan Berg wrote:
> Just thinking. If the parameters are limited, you may be able to use the
> histogram feature? Doing one histogram with Y as weights, then one
> without weights and calculating the mean from this yourself should be
> pretty speedy I imagine.

I'm afraid I don't know what the histogram function computes. But this
may be something worth to investigate because I think I'll need it later
on in order to smooth my graphs (by plotting mean values on intervals).

Bruce Southey wrote:
> If you use Knuth's one pass approach
>
(http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#III._On-line_algorithm)

> you can write a function to get the min, max, mean and variance/standard
> deviation in a single pass through the array rather than one pass for
> each. I do not know if this will provide any advantage as that will
> probably depend on the size of the arrays.

If I understood well, this algorithm computes the variance of a whole
array, I can see how to adapt it to compute mean (already done by the
algorithm), max, min, etc., but I did not see how it can be adapted to
my case.

> Also, please use the highest precision possible (ie float128) for your
> arrays to minimize numerical error due to the size of your arrays.

Thanks for the advice!

So, thank you again everybody.
Cheers,
Franck