[pypy-commit] extradoc extradoc: describe the scimark benchmakrs
hakanardo
noreply at buildbot.pypy.org
Wed Aug 15 19:44:05 CEST 2012
Author: Hakan Ardo <hakan at debian.org>
Branch: extradoc
Changeset: r4585:4887f7fc2e99
Date: 2012-08-15 19:43 +0200
http://bitbucket.org/pypy/extradoc/changeset/4887f7fc2e99/
Log: describe the scimark benchmakrs
diff --git a/talk/dls2012/licm.pdf b/talk/dls2012/licm.pdf
index d0e3ca21bc58e605bbf333d46f6acdc18de2a29d..d44c6adbc9741258517f595c681611569b3e9240
GIT binary patch
[cut]
diff --git a/talk/dls2012/paper.tex b/talk/dls2012/paper.tex
--- a/talk/dls2012/paper.tex
+++ b/talk/dls2012/paper.tex
@@ -63,7 +63,7 @@
\newboolean{showcomments}
-\setboolean{showcomments}{true}
+\setboolean{showcomments}{false}
\ifthenelse{\boolean{showcomments}}
{\newcommand{\nb}[2]{
\fbox{\bfseries\sffamily\scriptsize#1}
@@ -931,8 +931,9 @@
we see improvements in several cases. The ideal loop for this optimization
is short and contains numerical calculations with no failing guards and no
external calls. Larger loops involving many operations on complex objects
-typically benefit less from it. Loop peeling never makes runtime performance worse, in
-the worst case the peeled loop is exactly the same as the preamble. Therefore we
+typically benefit less from it. Loop peeling never makes the generated code worse, in
+the worst case the peeled loop is exactly the same as the preamble.
+Therefore we
chose to present benchmarks of small numeric kernels where loop peeling can show
its use.
@@ -983,30 +984,30 @@
\subsection{Python}
The Python interpreter of the RPython framework is a complete Python
version 2.7 compatible interpreter. A set of numerical
-calculations were implemented in both Python and in C and their
+calculations were implemented in both Python, C and Lua and their
runtimes are compared in Figure~\ref{fig:benchmarks}. The benchmarks are
\begin{itemize}
-\item {\bf sqrt}: approximates the square root of $y$. The approximation is
+\item {\bf sqrt}$\left(T\right)$: approximates the square root of $y$. The approximation is
initiated to $x_0=y/2$ and the benchmark consists of a single loop updating this
approximation using $x_i = \left( x_{i-1} + y/x_{i-1} \right) / 2$ for $1\leq i < 10^8$.
Only the latest calculated value $x_i$ is kept alive as a local variable within the loop.
There are three different versions of this benchmark where $x_i$
- is represented with different type of objects: int's, float's and
+ is represented with different type of objects, $T$,: int's, float's and
Fix16's. The latter, Fix16, is a custom class that implements
fixpoint arithmetic with 16 bits precision. In Python there is only
a single implementation of the benchmark that gets specialized
depending on the class of it's input argument, $y$, while in C,
there are three different implementations.
-\item {\bf conv3}: one-dimensional convolution with fixed kernel-size $3$. A single loop
+\item {\bf conv3}$\left(n\right)$: one-dimensional convolution with fixed kernel-size $3$. A single loop
is used to calculate a vector ${\bf b} = \left(b_1, \cdots, b_n\right)$ from a vector
${\bf a} = \left(a_1, \cdots, a_n\right)$ and a kernel ${\bf k} = \left(k_1, k_2, k_3\right)$ using
$b_i = k_3 a_i + k_2 a_{i+1} + k_1 a_{i+2}$ for $1 \leq i \leq n$. Both the output vector, $\bf b$,
and the input vectors, $\bf a$ and $\bf k$, are allocated prior to running the benchmark. It is executed
with $n=10^5$ and $n=10^6$.
-\item {\bf conv5}: one-dimensional convolution with fixed kernel-size $5$. Similar to conv3, but with
+\item {\bf conv5}$\left(n\right)$: one-dimensional convolution with fixed kernel-size $5$. Similar to conv3, but with
${\bf k} = \left(k_1, k_2, k_3, k_4, k_5\right)$. The enumeration of the elements in $\bf k$ is still
hardcoded into the implementation making the benchmark consist of a single loop too.
-\item {\bf conv3x3}: two-dimensional convolution with kernel of fixed
+\item {\bf conv3x3}$\left(n\right)$: two-dimensional convolution with kernel of fixed
size $3 \times 3$ using a custom class to represent two-dimensional
arrays. It is implemented as two nested loops that iterates over the elements of the
$n\times n$ output matrix ${\bf B} = \left(b_{i,j}\right)$ and calculates each element from the input matrix
@@ -1021,12 +1022,12 @@
\end{equation}
for $1 \leq i \leq n$ and $1 \leq j \leq n$.
The memory for storing the matrices are again allocated outside the benchmark and $n=1000$ was used.
-\item {\bf dilate3x3}: two-dimensional dilation with kernel of fixed
+\item {\bf dilate3x3}$\left(n\right)$: two-dimensional dilation with kernel of fixed
size $3 \times 3$. This is similar to convolution but instead of
summing over the terms in Equation~\ref{eq:convsum}, the maximum over those terms is taken. That places a
external call to a max function within the loop that prevents some
of the optimizations.
-\item {\bf sobel}: a low-level video processing algorithm used to
+\item {\bf sobel}$\left(n\right)$: a low-level video processing algorithm used to
locate edges in an image. It calculates the gradient magnitude
using sobel derivatives. A Sobel x-derivative, $D_x$, of a $n \times n$ image, ${I}$, is formed
by convolving ${I}$ with
@@ -1050,11 +1051,31 @@
on top of a custom two-dimensional array class.
It is
a straightforward implementation providing 2 dimensional
-indexing with out of bounds checks. For the C implementations it is
+indexing with out of bounds checks and
+data stored in row-major order.
+For the C implementations it is
implemented as a C++ class. The other benchmarks are implemented in
plain C. All the benchmarks except sqrt operate on C double-precision floating
point numbers, both in the Python and the C code.
+In addition we also ported the
+SciMark\footnote{\texttt{http://math.nist.gov/scimark2/}} benchmakts to python, and compared
+their runtimes with the already existing Lua and C implementations.
+This port was performed after the release of the pypy used to run the benchmarks which means that
+these benchmarks have not influenced the pypy implementation.
+SciMark consists of
+
+\begin{itemize}
+\item {\bf SOR}$\left(n, c\right)$: Jacobi successive over-relaxation on a $n\times n$ grid repreated $c$ times.
+The same custom two-dimensional array class as described above is used to represent
+the gird.
+\item {\bf SparseMatMult}$\left(n, z, c\right)$: Matrix multiplication between a $n\times n$ sparse matrix,
+stored in compressed-row format, and a full storage vector, stored in a normal array. The matrix has $z$ non-zero elements and the calculation is repeated $c$ times.
+\item {\bf MonteCarlo}$\left(n\right)$: Monte Carlo integration by generating $n$ points uniformly distributed over the unit square and computing the ratio of those within the unit circle.
+\item {\bf LU}$\left(n, c\right)$: Computes the LU factorization of a $n \times n$ matrix. The rows of the matrix is shuffled which makes the previously used two-dimensional array class unsuitable. Instead a list of arrays is used to represent the matrix. The calculation is repeated $c$ times.
+\item {\bf FFT}$\left(n, c\right)$: Fast Fourier Transform of a vector with $n$ elements, represented as an array, repeated $c$ times.
+\end{itemize}
+
Benchmarks were run on Intel i7 M620 @2.67GHz with 4M cache and 8G of RAM
using Ubuntu Linux 11.4 in 32bit mode.
The machine was otherwise unoccupied. We use the following software
More information about the pypy-commit
mailing list