[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

Sat Mar 22 20:34:29 EDT 2008

On Sat, Mar 22, 2008 at 5:32 PM, Charles R Harris <charlesr.harris at gmail.com>
wrote:

>
>
> On Sat, Mar 22, 2008 at 5:03 PM, James Philbin <philbinj at gmail.com> wrote:
>
> > OK, i've written a simple benchmark which implements an elementwise
> > multiply (A=B*C) in three different ways (standard C, intrinsics, hand
> > coded assembly). On the face of things the results seem to indicate
> > that the vectorization works best on medium sized inputs. If people
> > could post the results of running the benchmark on their machines
> > (takes ~1min) along with the output of gcc --version and their chip
> > model, that wd be v useful.
> >
> > It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench
> >
> > Here's two:
> >
> > CPU: Core Duo T2500 @ 2GHz
> > gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4)
> >        Problem size              Simple              Intrin
> >  Inline
> >                 100   0.0003ms (100.0%)   0.0002ms ( 67.7%)   0.0002ms (
> > 50.6%)
> >                1000   0.0030ms (100.0%)   0.0021ms ( 69.2%)   0.0015ms (
> > 50.6%)
> >               10000   0.0370ms (100.0%)   0.0267ms ( 72.0%)   0.0279ms (
> > 75.4%)
> >              100000   0.2258ms (100.0%)   0.1469ms ( 65.0%)   0.1273ms (
> > 56.4%)
> >             1000000   4.5690ms (100.0%)   4.4616ms ( 97.6%)   4.4185ms (
> > 96.7%)
> >            10000000  47.0022ms (100.0%)  45.4100ms ( 96.6%)  44.4437ms (
> > 94.6%)
> >
> > CPU: Intel Xeon E5345 @ 2.33Ghz
> > gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
> >        Problem size              Simple              Intrin
> >  Inline
> >                 100   0.0001ms (100.0%)   0.0001ms ( 69.2%)   0.0001ms (
> > 77.4%)
> >                1000   0.0010ms (100.0%)   0.0008ms ( 78.1%)   0.0009ms (
> > 86.6%)
> >               10000   0.0108ms (100.0%)   0.0088ms ( 81.2%)   0.0086ms (
> > 79.6%)
> >              100000   0.1131ms (100.0%)   0.0897ms ( 79.3%)   0.0872ms (
> > 77.1%)
> >             1000000   5.2103ms (100.0%)   3.9153ms ( 75.1%)   3.8328ms (
> > 73.6%)
> >            10000000  54.1815ms (100.0%)  51.8286ms ( 95.7%)  51.4366ms (
> > 94.9%)
> >
>
> gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
> cpu:  Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
>
>         Problem size              Simple              Intrin
> Inline
>                  100   0.0002ms (100.0%)   0.0001ms ( 68.7%)   0.0001ms (
> 74.8%)
>                 1000   0.0015ms (100.0%)   0.0011ms ( 72.0%)   0.0012ms (
> 80.4%)
>                10000   0.0154ms (100.0%)   0.0111ms ( 72.1%)   0.0122ms (
> 79.1%)
>               100000   0.1081ms (100.0%)   0.0759ms ( 70.2%)   0.0811ms (
> 75.0%)
>              1000000   2.7778ms (100.0%)   2.8172ms (101.4%)   2.7929ms (
> 100.5%)
>             10000000  28.1577ms (100.0%)  28.7332ms (102.0%)  28.4669ms (
> 101.1%)
>
> It looks like memory access is the bottleneck, otherwise running 4 floats
> through in parallel should go a lot faster. I need to modify the program a
> bit and see how it works for doubles.
>

Doubles don't look so good running on a 32 bit OS. Maybe alignment would
help.
Compiled with gcc -msse2 -mfpmath=sse -O2 vec_bench_dbl.c -o vec_bench_dbl

        Problem size              Simple              Intrin
                 100   0.0002ms (100.0%)   0.0002ms (149.5%)
                1000   0.0015ms (100.0%)   0.0024ms (159.0%)
               10000   0.0219ms (100.0%)   0.0180ms ( 81.9%)
              100000   0.1518ms (100.0%)   0.1686ms (111.1%)
             1000000   5.5588ms (100.0%)   5.8145ms (104.6%)
            10000000  56.7152ms (100.0%)  59.3139ms (104.6%)

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080322/28d0daad/attachment.html>