[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Charles R Harris
charlesr.harris at gmail.com
Sat Mar 22 20:34:29 EDT 2008
On Sat, Mar 22, 2008 at 5:32 PM, Charles R Harris <charlesr.harris at gmail.com>
wrote:
>
>
> On Sat, Mar 22, 2008 at 5:03 PM, James Philbin <philbinj at gmail.com> wrote:
>
> > OK, i've written a simple benchmark which implements an elementwise
> > multiply (A=B*C) in three different ways (standard C, intrinsics, hand
> > coded assembly). On the face of things the results seem to indicate
> > that the vectorization works best on medium sized inputs. If people
> > could post the results of running the benchmark on their machines
> > (takes ~1min) along with the output of gcc --version and their chip
> > model, that wd be v useful.
> >
> > It should be compiled with: gcc -msse -O2 vec_bench.c -o vec_bench
> >
> > Here's two:
> >
> > CPU: Core Duo T2500 @ 2GHz
> > gcc --version: gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4)
> > Problem size Simple Intrin
> > Inline
> > 100 0.0003ms (100.0%) 0.0002ms ( 67.7%) 0.0002ms (
> > 50.6%)
> > 1000 0.0030ms (100.0%) 0.0021ms ( 69.2%) 0.0015ms (
> > 50.6%)
> > 10000 0.0370ms (100.0%) 0.0267ms ( 72.0%) 0.0279ms (
> > 75.4%)
> > 100000 0.2258ms (100.0%) 0.1469ms ( 65.0%) 0.1273ms (
> > 56.4%)
> > 1000000 4.5690ms (100.0%) 4.4616ms ( 97.6%) 4.4185ms (
> > 96.7%)
> > 10000000 47.0022ms (100.0%) 45.4100ms ( 96.6%) 44.4437ms (
> > 94.6%)
> >
> > CPU: Intel Xeon E5345 @ 2.33Ghz
> > gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
> > Problem size Simple Intrin
> > Inline
> > 100 0.0001ms (100.0%) 0.0001ms ( 69.2%) 0.0001ms (
> > 77.4%)
> > 1000 0.0010ms (100.0%) 0.0008ms ( 78.1%) 0.0009ms (
> > 86.6%)
> > 10000 0.0108ms (100.0%) 0.0088ms ( 81.2%) 0.0086ms (
> > 79.6%)
> > 100000 0.1131ms (100.0%) 0.0897ms ( 79.3%) 0.0872ms (
> > 77.1%)
> > 1000000 5.2103ms (100.0%) 3.9153ms ( 75.1%) 3.8328ms (
> > 73.6%)
> > 10000000 54.1815ms (100.0%) 51.8286ms ( 95.7%) 51.4366ms (
> > 94.9%)
> >
>
> gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
> cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
>
> Problem size Simple Intrin
> Inline
> 100 0.0002ms (100.0%) 0.0001ms ( 68.7%) 0.0001ms (
> 74.8%)
> 1000 0.0015ms (100.0%) 0.0011ms ( 72.0%) 0.0012ms (
> 80.4%)
> 10000 0.0154ms (100.0%) 0.0111ms ( 72.1%) 0.0122ms (
> 79.1%)
> 100000 0.1081ms (100.0%) 0.0759ms ( 70.2%) 0.0811ms (
> 75.0%)
> 1000000 2.7778ms (100.0%) 2.8172ms (101.4%) 2.7929ms (
> 100.5%)
> 10000000 28.1577ms (100.0%) 28.7332ms (102.0%) 28.4669ms (
> 101.1%)
>
> It looks like memory access is the bottleneck, otherwise running 4 floats
> through in parallel should go a lot faster. I need to modify the program a
> bit and see how it works for doubles.
>
Doubles don't look so good running on a 32 bit OS. Maybe alignment would
help.
Compiled with gcc -msse2 -mfpmath=sse -O2 vec_bench_dbl.c -o vec_bench_dbl
Problem size Simple Intrin
100 0.0002ms (100.0%) 0.0002ms (149.5%)
1000 0.0015ms (100.0%) 0.0024ms (159.0%)
10000 0.0219ms (100.0%) 0.0180ms ( 81.9%)
100000 0.1518ms (100.0%) 0.1686ms (111.1%)
1000000 5.5588ms (100.0%) 5.8145ms (104.6%)
10000000 56.7152ms (100.0%) 59.3139ms (104.6%)
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080322/28d0daad/attachment.html>
More information about the NumPy-Discussion
mailing list