[pypy-commit] extradoc extradoc: describe the scimark benchmakrs

Wed Aug 15 19:44:05 CEST 2012

Author: Hakan Ardo <hakan at debian.org>
Branch: extradoc
Changeset: r4585:4887f7fc2e99
Date: 2012-08-15 19:43 +0200
http://bitbucket.org/pypy/extradoc/changeset/4887f7fc2e99/

Log:	describe the scimark benchmakrs

diff --git a/talk/dls2012/licm.pdf b/talk/dls2012/licm.pdf
index d0e3ca21bc58e605bbf333d46f6acdc18de2a29d..d44c6adbc9741258517f595c681611569b3e9240
GIT binary patch

[cut]

diff --git a/talk/dls2012/paper.tex b/talk/dls2012/paper.tex
--- a/talk/dls2012/paper.tex
+++ b/talk/dls2012/paper.tex
@@ -63,7 +63,7 @@
 
 
 \newboolean{showcomments}
-\setboolean{showcomments}{true}
+\setboolean{showcomments}{false}
 \ifthenelse{\boolean{showcomments}}
   {\newcommand{\nb}[2]{
     \fbox{\bfseries\sffamily\scriptsize#1}
@@ -931,8 +931,9 @@
 we see improvements in several cases. The ideal loop for this optimization
 is short and contains numerical calculations with no failing guards and no
 external calls. Larger loops involving many operations on complex objects
-typically benefit less from it. Loop peeling never makes runtime performance worse, in
-the worst case the peeled loop is exactly the same as the preamble. Therefore we
+typically benefit less from it. Loop peeling never makes the generated code worse, in
+the worst case the peeled loop is exactly the same as the preamble. 
+Therefore we
 chose to present benchmarks of small numeric kernels where loop peeling can show
 its use.
 
@@ -983,30 +984,30 @@
 \subsection{Python}
 The Python interpreter of the RPython framework is a complete Python
 version 2.7 compatible interpreter. A set of numerical
-calculations were implemented in both Python and in C and their
+calculations were implemented in both Python, C and Lua and their
 runtimes are compared in Figure~\ref{fig:benchmarks}. The benchmarks are
 \begin{itemize}
-\item {\bf sqrt}: approximates the square root of $y$. The approximation is 
+\item {\bf sqrt}$\left(T\right)$: approximates the square root of $y$. The approximation is 
 initiated to $x_0=y/2$ and the benchmark consists of a single loop updating this
 approximation using $x_i = \left( x_{i-1} + y/x_{i-1} \right) / 2$ for $1\leq i < 10^8$. 
 Only the latest calculated value $x_i$ is kept alive as a local variable within the loop.
 There are three different versions of this benchmark where $x_i$
-  is represented with different type of objects: int's, float's and
+  is represented with different type of objects, $T$,: int's, float's and
   Fix16's. The latter, Fix16, is a custom class that implements
   fixpoint arithmetic with 16 bits precision. In Python there is only
   a single implementation of the benchmark that gets specialized
   depending on the class of it's input argument, $y$, while in C,
   there are three different implementations.
-\item {\bf conv3}: one-dimensional convolution with fixed kernel-size $3$. A single loop
+\item {\bf conv3}$\left(n\right)$: one-dimensional convolution with fixed kernel-size $3$. A single loop
 is used to calculate a vector ${\bf b} = \left(b_1, \cdots, b_n\right)$ from a vector
 ${\bf a} = \left(a_1, \cdots, a_n\right)$ and a kernel ${\bf k} = \left(k_1, k_2, k_3\right)$ using 
 $b_i = k_3 a_i + k_2 a_{i+1} + k_1 a_{i+2}$ for $1 \leq i \leq n$. Both the output vector, $\bf b$, 
 and the input vectors, $\bf a$ and $\bf k$, are allocated prior to running the benchmark. It is executed 
 with $n=10^5$ and $n=10^6$.
-\item {\bf conv5}: one-dimensional convolution with fixed kernel-size $5$. Similar to conv3, but with 
+\item {\bf conv5}$\left(n\right)$: one-dimensional convolution with fixed kernel-size $5$. Similar to conv3, but with 
 ${\bf k} = \left(k_1, k_2, k_3, k_4, k_5\right)$. The enumeration of the elements in $\bf k$ is still 
 hardcoded into the implementation making the benchmark consist of a single loop too.
-\item {\bf conv3x3}: two-dimensional convolution with kernel of fixed
+\item {\bf conv3x3}$\left(n\right)$: two-dimensional convolution with kernel of fixed
   size $3 \times 3$ using a custom class to represent two-dimensional
   arrays. It is implemented as two nested loops that iterates over the elements of the 
 $n\times n$ output matrix ${\bf B} = \left(b_{i,j}\right)$ and calculates each element from the input matrix
@@ -1021,12 +1022,12 @@
 \end{equation}
 for $1 \leq i \leq n$ and $1 \leq j \leq n$.
 The memory for storing the matrices are again allocated outside the benchmark and $n=1000$ was used.
-\item {\bf dilate3x3}: two-dimensional dilation with kernel of fixed
+\item {\bf dilate3x3}$\left(n\right)$: two-dimensional dilation with kernel of fixed
   size $3 \times 3$. This is similar to convolution but instead of
   summing over the terms in Equation~\ref{eq:convsum}, the maximum over those terms is taken. That places a
   external call to a max function within the loop that prevents some
   of the optimizations.
-\item {\bf sobel}: a low-level video processing algorithm used to
+\item {\bf sobel}$\left(n\right)$: a low-level video processing algorithm used to
   locate edges in an image. It calculates the gradient magnitude
   using sobel derivatives. A Sobel x-derivative, $D_x$, of a $n \times n$ image, ${I}$, is formed
 by convolving ${I}$ with
@@ -1050,11 +1051,31 @@
 on top of a custom two-dimensional array class.
 It is
 a straightforward implementation providing 2 dimensional
-indexing with out of bounds checks. For the C implementations it is
+indexing with out of bounds checks and 
+data stored in row-major order.
+For the C implementations it is
 implemented as a C++ class. The other benchmarks are implemented in
 plain C. All the benchmarks except sqrt operate on C double-precision floating
 point numbers, both in the Python and the C code.
 
+In addition we also ported the 
+SciMark\footnote{\texttt{http://math.nist.gov/scimark2/}} benchmakts to python, and compared 
+their runtimes with the already existing Lua and C implementations. 
+This port was performed after the release of the pypy used to run the benchmarks which means that 
+these benchmarks have not influenced the pypy implementation.
+SciMark consists of 
+
+\begin{itemize}
+\item {\bf SOR}$\left(n, c\right)$: Jacobi successive over-relaxation on a $n\times n$ grid repreated $c$ times.
+The same custom two-dimensional array class as described above is used to represent
+the gird.
+\item {\bf SparseMatMult}$\left(n, z, c\right)$: Matrix multiplication between a $n\times n$ sparse matrix,
+stored in compressed-row format, and a full storage vector, stored in a normal array. The matrix has $z$ non-zero elements and the calculation is repeated $c$ times.
+\item {\bf MonteCarlo}$\left(n\right)$: Monte Carlo integration by generating $n$ points uniformly distributed over the unit square and computing the ratio of those within the unit circle.
+\item {\bf LU}$\left(n, c\right)$: Computes the LU factorization of a $n \times n$ matrix. The rows of the matrix is shuffled which makes the previously used two-dimensional array class unsuitable. Instead a list of arrays is used to represent the matrix. The calculation is repeated $c$ times.
+\item {\bf FFT}$\left(n, c\right)$: Fast Fourier Transform of a vector with $n$ elements, represented as an array, repeated $c$ times.
+\end{itemize}
+
 Benchmarks were run on Intel i7 M620 @2.67GHz with 4M cache and 8G of RAM
 using Ubuntu Linux 11.4 in 32bit mode.
 The machine was otherwise unoccupied. We use the following software