[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Charles R Harris
charlesr.harris at gmail.com
Sat Mar 22 19:32:17 EDT 2008
On Sat, Mar 22, 2008 at 5:03 PM, James Philbin <philbinj at gmail.com> wrote:
> OK, i've written a simple benchmark which implements an elementwise
> multiply (A=B*C) in three different ways (standard C, intrinsics, hand
> coded assembly). On the face of things the results seem to indicate
> that the vectorization works best on medium sized inputs. If people
> could post the results of running the benchmark on their machines
> (takes ~1min) along with the output of gcc --version and their chip
> model, that wd be v useful.
>
> It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench
>
> Here's two:
>
> CPU: Core Duo T2500 @ 2GHz
> gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4)
> Problem size Simple Intrin
> Inline
> 100 0.0003ms (100.0%) 0.0002ms ( 67.7%) 0.0002ms (
> 50.6%)
> 1000 0.0030ms (100.0%) 0.0021ms ( 69.2%) 0.0015ms (
> 50.6%)
> 10000 0.0370ms (100.0%) 0.0267ms ( 72.0%) 0.0279ms (
> 75.4%)
> 100000 0.2258ms (100.0%) 0.1469ms ( 65.0%) 0.1273ms (
> 56.4%)
> 1000000 4.5690ms (100.0%) 4.4616ms ( 97.6%) 4.4185ms (
> 96.7%)
> 10000000 47.0022ms (100.0%) 45.4100ms ( 96.6%) 44.4437ms (
> 94.6%)
>
> CPU: Intel Xeon E5345 @ 2.33Ghz
> gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
> Problem size Simple Intrin
> Inline
> 100 0.0001ms (100.0%) 0.0001ms ( 69.2%) 0.0001ms (
> 77.4%)
> 1000 0.0010ms (100.0%) 0.0008ms ( 78.1%) 0.0009ms (
> 86.6%)
> 10000 0.0108ms (100.0%) 0.0088ms ( 81.2%) 0.0086ms (
> 79.6%)
> 100000 0.1131ms (100.0%) 0.0897ms ( 79.3%) 0.0872ms (
> 77.1%)
> 1000000 5.2103ms (100.0%) 3.9153ms ( 75.1%) 3.8328ms (
> 73.6%)
> 10000000 54.1815ms (100.0%) 51.8286ms ( 95.7%) 51.4366ms (
> 94.9%)
>
gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Problem size Simple Intrin
Inline
100 0.0002ms (100.0%) 0.0001ms ( 68.7%) 0.0001ms (
74.8%)
1000 0.0015ms (100.0%) 0.0011ms ( 72.0%) 0.0012ms (
80.4%)
10000 0.0154ms (100.0%) 0.0111ms ( 72.1%) 0.0122ms (
79.1%)
100000 0.1081ms (100.0%) 0.0759ms ( 70.2%) 0.0811ms (
75.0%)
1000000 2.7778ms (100.0%) 2.8172ms (101.4%) 2.7929ms (
100.5%)
10000000 28.1577ms (100.0%) 28.7332ms (102.0%) 28.4669ms (
101.1%)
It looks like memory access is the bottleneck, otherwise running 4 floats
through in parallel should go a lot faster. I need to modify the program a
bit and see how it works for doubles.
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080322/999db12a/attachment.html>
More information about the NumPy-Discussion
mailing list