[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

Sat Mar 22 19:32:17 EDT 2008

On Sat, Mar 22, 2008 at 5:03 PM, James Philbin <philbinj at gmail.com> wrote:

> OK, i've written a simple benchmark which implements an elementwise
> multiply (A=B*C) in three different ways (standard C, intrinsics, hand
> coded assembly). On the face of things the results seem to indicate
> that the vectorization works best on medium sized inputs. If people
> could post the results of running the benchmark on their machines
> (takes ~1min) along with the output of gcc --version and their chip
> model, that wd be v useful.
>
> It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench
>
> Here's two:
>
> CPU: Core Duo T2500 @ 2GHz
> gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4)
>        Problem size              Simple              Intrin
>  Inline
>                 100   0.0003ms (100.0%)   0.0002ms ( 67.7%)   0.0002ms (
> 50.6%)
>                1000   0.0030ms (100.0%)   0.0021ms ( 69.2%)   0.0015ms (
> 50.6%)
>               10000   0.0370ms (100.0%)   0.0267ms ( 72.0%)   0.0279ms (
> 75.4%)
>              100000   0.2258ms (100.0%)   0.1469ms ( 65.0%)   0.1273ms (
> 56.4%)
>             1000000   4.5690ms (100.0%)   4.4616ms ( 97.6%)   4.4185ms (
> 96.7%)
>            10000000  47.0022ms (100.0%)  45.4100ms ( 96.6%)  44.4437ms (
> 94.6%)
>
> CPU: Intel Xeon E5345 @ 2.33Ghz
> gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
>        Problem size              Simple              Intrin
>  Inline
>                 100   0.0001ms (100.0%)   0.0001ms ( 69.2%)   0.0001ms (
> 77.4%)
>                1000   0.0010ms (100.0%)   0.0008ms ( 78.1%)   0.0009ms (
> 86.6%)
>               10000   0.0108ms (100.0%)   0.0088ms ( 81.2%)   0.0086ms (
> 79.6%)
>              100000   0.1131ms (100.0%)   0.0897ms ( 79.3%)   0.0872ms (
> 77.1%)
>             1000000   5.2103ms (100.0%)   3.9153ms ( 75.1%)   3.8328ms (
> 73.6%)
>            10000000  54.1815ms (100.0%)  51.8286ms ( 95.7%)  51.4366ms (
> 94.9%)
>

gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
cpu:  Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz

        Problem size              Simple              Intrin
Inline
                 100   0.0002ms (100.0%)   0.0001ms ( 68.7%)   0.0001ms (
74.8%)
                1000   0.0015ms (100.0%)   0.0011ms ( 72.0%)   0.0012ms (
80.4%)
               10000   0.0154ms (100.0%)   0.0111ms ( 72.1%)   0.0122ms (
79.1%)
              100000   0.1081ms (100.0%)   0.0759ms ( 70.2%)   0.0811ms (
75.0%)
             1000000   2.7778ms (100.0%)   2.8172ms (101.4%)   2.7929ms (
100.5%)
            10000000  28.1577ms (100.0%)  28.7332ms (102.0%)  28.4669ms (
101.1%)

It looks like memory access is the bottleneck, otherwise running 4 floats
through in parallel should go a lot faster. I need to modify the program a
bit and see how it works for doubles.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080322/999db12a/attachment.html>