[pypy-svn] r63725 - pypy/extradoc/talk/icooolps2009

Mon Apr 6 17:24:18 CEST 2009

Author: cfbolz
Date: Mon Apr  6 17:24:15 2009
New Revision: 63725

Modified:
   pypy/extradoc/talk/icooolps2009/paper.tex
Log:
A round of language review.


Modified: pypy/extradoc/talk/icooolps2009/paper.tex
==============================================================================

--- pypy/extradoc/talk/icooolps2009/paper.tex	(original)
+++ pypy/extradoc/talk/icooolps2009/paper.tex	Mon Apr  6 17:24:15 2009
@@ -74,7 +74,7 @@
 We present techniques for improving the results when a tracing JIT compiler is
 applied to an interpreter. An unmodified tracing JIT performs not as well as one
 would hope when the compiled program is itself a bytecode interpreter. We
-examine why that is the case, and how matters can be improved by adding hints to
+examine why that is the case, and how matters can be improved by adding markers to
 the interpreter, that help the tracing JIT to improve the results. We evaluate
 the techniques by using them both on a small example as well as on a full Python
 interpreter. This work has been done in the context of the PyPy project.
@@ -86,7 +86,7 @@
 
 Dynamic languages have seen a steady rise in popularity in recent years.
 JavaScript is increasingly being used to implement full-scale applications
-running in browser, whereas other dynamic languages (such as Ruby, Perl, Python,
+which run within a browser, whereas other dynamic languages (such as Ruby, Perl, Python,
 PHP) are used for the server side of many web sites, as well as in areas
 unrelated to the web.
 
@@ -121,18 +121,18 @@
 
 In this paper we discuss ongoing work in the PyPy project to improve the
 performance of interpreters written with the help of the PyPy toolchain. The
-approach is that of a tracing JIT compiler. Opposed to the tracing JITs for dynamic
-languages that exist so far, PyPy's tracing JIT operates "one level down",
-e.g. traces the execution of the interpreter, as opposed to the execution
+approach is that of a tracing JIT compiler. Contrary to the tracing JITs for dynamic
+languages that currently exist, PyPy's tracing JIT operates "one level down",
+e.g. it traces the execution of the interpreter, as opposed to the execution
 of the user program. The fact that the program the tracing JIT compiles is
 in our case always an interpreter brings its own set of problems. We describe
 tracing JITs and their application to interpreters in Section
-\ref{sect:tracing}.  By this approach we hope to get a JIT compiler that can be
-applied to a variety of dynamic languages, given an interpreter for them. The
+\ref{sect:tracing}.  By this approach we hope to arrive at a JIT compiler that can be
+applied to a variety of dynamic languages, given an appropriate interpreter for each of them. The
 process is not completely automatic but needs a small number of hints from the
 interpreter author, to help the tracing JIT. The details of how the process
 integrates into the rest of PyPy will be explained in Section
-\ref{sect:implementation}. This work is not finished, but already produces some
+\ref{sect:implementation}. This work is not finished, but has already produced some
 promising results, which we will discuss in Section \ref{sect:evaluation}.
 
 The contributions of this paper are:
@@ -205,16 +205,6 @@
 \section{Tracing JIT Compilers}
 \label{sect:tracing}
 
-\arigo{We should not start from scratch and insert as little details what
-differs in our approach when compared to others; instead we should give a
-higher-level overview and then focus on these details, and a couple of
-references for more info about the "common" part.
-%
-In general there are many things that are never said at all.
-I think the introduction should really be written from the point of view of
-someone that has read already some papers for JavaScript.}
-
-
 Tracing JITs are an idea initially explored by the Dynamo project
 \cite{bala_dynamo:transparent_2000} in the context of dynamic optimization of
 machine code at runtime. The techniques were then successfully applied to Java
@@ -250,10 +240,10 @@
 \emph{tracing}.
 
 At first, when the program starts, everything is interpreted.
-The interpreter does a bit of lightweight profiling to figure out which loops
-are run often. This lightweight profiling is usually done by having a counter on
+The interpreter does a small amount of lightweight profiling to establish which loops
+are run most frequently. This lightweight profiling is usually done by having a counter on
 each backward jump instruction that counts how often this particular backward jump
-was executed. Since loops need a backward jump somewhere, this method finds
+was executed. Since loops need a backward jump somewhere, this method looks for
 loops in the user program.
 
 When a hot loop is identified, the interpreter enters a
@@ -262,22 +252,22 @@
 
 Such a history is called a \emph{trace}: it is a sequential list of
 operations, together with their actual operands and results.  By examining the
-trace, it is possible to produce highly efficient machine code by emitting
+trace, it is possible to produce highly efficient machine code by generating
 only the operations needed.  Being sequential, the trace represents only one
 of the many possible paths through the code. To ensure correctness, the trace
 contains a \emph{guard} at every possible point where the path could have
 followed another direction, for example conditions or indirect/virtual
-calls.  When emitting the machine code, every guard is turned into a quick check
+calls.  When generating the machine code, every guard is turned into a quick check
 to guarantee that the path we are executing is still valid.  If a guard fails,
-we immediately quit from the machine code and continue the execution by falling
+we immediately quit the machine code and continue the execution by falling
 back to interpretation.
 
 During tracing, the trace is repeatedly
-checked whether the interpreter is at a position in the program that it had seen
-earlier in the trace. If this happens, the trace recorded corresponds to a loop
+checked as to whether the interpreter is at a position in the program where it had been
+earlier. If this happens, the trace recorded corresponds to a loop
 in the interpreted program. At this point, this loop
 is turned into machine code by taking the trace and making machine code versions
-of all the operations in it. The machine code can then be immediately executed,
+of all the operations in it. The machine code can then be executed immediately,
 starting from the next iteration of the loop, as the machine code represents
 exactly the loop that was being interpreted so far.
 
@@ -286,7 +276,7 @@
 it is possible that later another path through the loop is taken, in which case
 one of the guards that were put into the machine code will fail. There are more
 complex mechanisms in place to still produce more code for the cases of guard
-failures \cite{XXX}, but they are orthogonal to the issues discussed in this
+failures \cite{XXX}, but they are independent of the issues discussed in this
 paper.
 
 It is important to understand how the tracer recognizes that the trace it
@@ -326,7 +316,7 @@
 the intermediate representation of PyPy's translation toolchain after type
 inference has been performed and Python-specifics have been made explicit. At
 first those functions will be interpreted, but after a while, profiling shows
-that the \texttt{while} loop in \texttt{strange\_sum} is executed often.  The
+that the \texttt{while} loop in \texttt{strange\_sum} is executed often  The
 tracing JIT will then start to trace the execution of that loop.  The trace would
 look as follows:
 {\small
@@ -343,7 +333,7 @@
 \end{verbatim}
 }
 \vspace{-0.4cm}
-The operations in this sequence are operations of the mentioned intermediate
+The operations in this sequence are operations of the above-mentioned intermediate
 representation (e.g. note that the generic modulo and equality operations in the
 function above have been recognized to always take integers as arguments and are thus
 rendered as \texttt{int\_mod} and \texttt{int\_eq}). The trace contains all the
@@ -352,7 +342,7 @@
 failure. The call to \texttt{f} was inlined into the trace. Note that the trace
 contains only the hot \texttt{else} case of the \texttt{if} test in \texttt{f},
 while the other branch is implemented via a guard failure. This trace can then
-be turned into machine code and executed.
+be converted into machine code and executed.
 
 
 %- general introduction to tracing
@@ -365,11 +355,11 @@
 
 The tracing JIT of the PyPy project is atypical in that it is not applied to the
 user program, but to the interpreter running the user program. In this section
-we will explore what problems this brings, and how to solve them (at least
-partially). This means that there are two interpreters involved, and we need
-terminology to distinguish them. On the one hand, there is the interpreter that
+we will explore what problems this brings, and suggest how to solve them (at least
+partially). This means that there are two interpreters involved, and we need appropriate
+terminology to distinguish beween them. On the one hand, there is the interpreter that
 the tracing JIT uses to perform tracing. This we will call the \emph{tracing
-interpreter}. On the other hand, there is the interpreter that is running the
+interpreter}. On the other hand, there is the interpreter that runs the
 user's programs, which we will call the \emph{language interpreter}. In the
 following, we will assume that the language interpreter is bytecode-based. The
 program that the language interpreter executes we will call the \emph{user
@@ -380,11 +370,6 @@
 \emph{interpreter loops} are loops \emph{inside} the language interpreter. On
 the other hand, \emph{user loops} are loops in the user program.
 
-\fijal{I find following paragraph out of scope and completely confusing, we
-should instead simply state that we unroll the loop, how we do that and
-why we do that. Completely ignore aspect of an interpreter loop I suppose,
-because everything previously keeps talking about can\_enter\_jit that closes
-loop being available at jump back bytecodes}
 A tracing JIT compiler finds the hot loops of the program it is compiling. In
 our case, this program is the language interpreter. The most important hot loop
 of the language interpreter is its bytecode dispatch loop (for many simple
@@ -427,7 +412,6 @@
 \label{fig:square}
 \end{figure}
 
-\fijal{This paragraph should go away as well}
 Let's look at an example. Figure \ref{fig:tlr-basic} shows the code of a very
 simple bytecode interpreter with 256 registers and an accumulator. The
 \texttt{bytecode} argument is a string of bytes, all register and the
@@ -472,17 +456,17 @@
 value. This happens only at backward jumps in the language interpreter. That
 means that the tracing interpreter needs to check for a closed loop only when it
 encounters a backward jump in the language interpreter. Again the tracing JIT
-cannot known which part of the language interpreter implements backward jumps,
-so it needs to be told with the help of a hint by the author of the language
-interpreter.
+cannot know which part of the language interpreter implements backward jumps,
+so the author of the language interpreter needs to indicate this with the help
+of a hint.
 
 The condition for reusing already existing machine code needs to be adapted to
 this new situation. In a classical tracing JIT there is at most one piece of
 assembler code per loop of the jitted program, which in our case is the language
 interpreter. When applying the tracing JIT to the language interpreter as
 described so far, \emph{all} pieces of assembler code correspond to the bytecode
-dispatch loop of the language interpreter. They correspond to different
-unrollings and paths through that loop though. To figure out which of them to use
+dispatch loop of the language interpreter. However, they correspond to different
+paths through the loop and different ways to unroll it. To ascertain which of them to use
 when trying to enter assembler code again, the program counter of the language
 interpreter needs to be checked. If it corresponds to the position key of one of
 the pieces of assembler code, then this assembler code can be entered. This
@@ -512,7 +496,7 @@
 string is currently being interpreted. All other variables are red.
 
 In addition to the classification of the variables, there are two methods of
-\texttt{JitDriver} that need to be called. Both of them get as arguments the
+\texttt{JitDriver} that need to be called. Both of them receive as arguments the
 current values of the variables listed in the definition of the driver. The
 first one is \texttt{jit\_merge\_point} which needs to be put at the beginning
 of the body of the bytecode dispatch loop. The other, more interesting one, is
@@ -526,7 +510,7 @@
 the "green" variables are the same as at an earlier call to the
 \texttt{can\_enter\_jit} method.
 
-For the small example the hints look like a lot of work. However, the amount of
+For the small example the hints look like a lot of work. However, the number of
 hints is essentially constant no matter how large the interpreter is, which
 makes it seem less significant for larger interpreters.
 
@@ -553,7 +537,7 @@
 actually doing any computation that is part of the square function. Instead,
 they manipulate the data structures of the language interpreter. While this is
 to be expected, given that the tracing interpreter looks at the execution of the
-language interpreter, it would still be nicer if some of these operations could
+language interpreter, it would still be an improvement if some of these operations could
 be removed.
 
 The simple insight how to greatly improve the situation is that most of the
@@ -570,13 +554,13 @@
 number, so they can be folded away as well.
 
 With this optimization enabled, the trace looks as in Figure
-\ref{fig:trace-full}. Now a lot of the language interpreter is actually gone
+\ref{fig:trace-full}. Now much of the language interpreter is actually gone
 from the trace and what is left corresponds very closely to the loop of the
 square function. The only vestige of the language interpreter is the fact that
 the register list is still used to store the state of the computation. This
 could be removed by some other optimization, but is maybe not really all that
 bad anyway (in fact we have an experimental optimization that does exactly that,
-but it is not finished).  Once we get this optimized trace, we can pass it to
+but it is not yet finished).  Once we get this optimized trace, we can pass it to
 the \emph{JIT backend}, which generates the correspondent machine code.
 
 \begin{figure}
@@ -616,7 +600,7 @@
 whether the JIT should be built in or not. If the JIT is not enabled, all the
 hints that are possibly in the interpreter source are just ignored by the
 translation process. In this way, the result of the translation is identical to
-as if no hints were present in the interpreter at all.
+that when no hints were present in the interpreter at all.
 
 If the JIT is enabled, things are more interesting. At the moment the JIT can
 only be enabled when translating the interpreter to C, but we hope to lift that
@@ -630,7 +614,7 @@
 to be practical.
 
 What is done instead is that the language interpreter keeps running as a C
-program, until a hot loop in the user program is found. To identify loops the
+program, until a hot loop in the user program is found. To identify loops, the
 C version of the language interpreter is generated in such a way that at the
 place that corresponds to the \texttt{can\_enter\_jit} hint profiling is
 performed using the program counter of the language interpreter. Apart from this
@@ -669,8 +653,8 @@
 
 \subsection{Various Issues}
 
-This section will hint at some other implementation issues and optimizations
-that we have done that are beyond the scope of this paper (and will be subject
+This section will look at some other implementation issues and optimizations
+that we have done that are beyond the scope of this paper (and will be the subject
 of a later publication).
 
 \textbf{Assembler Backends:} The tracing interpreter uses a well-defined
@@ -690,7 +674,7 @@
 \textbf{Allocation Removal:} A key optimization for making the approach
 produce good code for more complex dynamic language is to perform escape
 analysis on the loop operation after tracing has been performed. In this way all
-objects that are allocated during the loop and don't actually escape the loop do
+objects that are allocated during the loop and do not actually escape the loop do
 not need to be allocated on the heap at all but can be exploded into their
 respective fields.  This is very helpful for dynamic languages where primitive
 types are often boxed, as the constant allocation of intermediate results is
@@ -711,7 +695,7 @@
 In this section we try to evaluate the work done so far by looking at some
 benchmark numbers. Since the work is not finished, these benchmarks can only be
 preliminary. Benchmarking was done on an otherwise idle machine with a 1.4
-GHz Pentium M processor and 1GiB RAM, using Linux 2.6.27. All benchmarks where
+GHz Pentium M processor and 1 GB RAM, using Linux 2.6.27. All benchmarks where
 run 50 times, each in a newly started process. The first run was ignored. The
 final numbers were reached by computing the average of all other runs, the
 confidence intervals were computed using a 95\% confidence level. All times
@@ -720,7 +704,7 @@
 
 The first round of benchmarks (Figure \ref{fig:bench1}) are timings of the
 example interpreter (Figure \ref{fig:tlr-basic}) used in this paper computing
-the square of 46340 (the smallest number so that the square still fits into a 32
+the square of 46340 (the smallest number whose square still fits into a 32
 bit word) using the bytecode of Figure \ref{fig:square}. The results for various
 constellations are as follows:
 
@@ -740,7 +724,7 @@
 previous case.
 
 \textbf{Benchmark 4:} Same as before, but with constant folding enabled. This corresponds to the
-trace in Figure \ref{fig:trace-full}. This speeds up the square function nicely,
+trace in Figure \ref{fig:trace-full}. This speeds up the square function considerably,
 making it about six times faster than the pure interpreter.
 
 \textbf{Benchmark 5:} Same as before, but with the threshold set so high that the tracer is
@@ -816,13 +800,13 @@
 specialisation is Tempo for C \cite{consel_general_1996, consel_uniform_1996}.
 However, it is essentially a normal
 partial evaluator ``packaged as a library''; decisions about what can be
-specialised and how are pre-determined. Another work in this direction is DyC
+specialised and how, are pre-determined. Another work in this direction is DyC
 \cite{grant_dyc:expressive_2000}, another runtime specializer for C. Both of these projects
-have a similar problem as DynamoRIO.  Targeting the C language makes
-higher-level specialisation difficult (e.g.\ \texttt{malloc} can not be
+have a problem similar to that of DynamoRIO.  Targeting the C language makes
+higher-level specialisation difficult (e.g.\ \texttt{malloc} cannot be
 optimized).
 
-There has been some attempts to do \emph{dynamic partial evaluation}, which is
+There have been some attempts to do \emph{dynamic partial evaluation}, which is
 partial evaluation that defers partial evaluation completely to runtime
 to make partial evaluation more useful for dynamic languages. This concept was
 introduced by Sullivan \cite{sullivan_dynamic_2001} who implemented it for a
@@ -836,17 +820,17 @@
 
 \section{Conclusion and Next Steps}
 
-We have shown techniques for making it practical to apply a tracing
+We have shown techniques for improving the results when applying a tracing
 JIT to an interpreter. Our first benchmarks indicate that these techniques work
 really well on small interpreters and first experiments with PyPy's Python
-interpreter make it seems likely that they can be scaled up to realistic
+interpreter make it appear likely that they can be scaled up to realistic
 examples.
 
 Of course there is a lot of work still left to do. Various optimizations are not
 quite finished. Both tracing and leaving machine code is very slow due to a
 double interpretation overhead and we might need techniques for improving those.
 Furthermore we need to apply the JIT to the various interpreters that are
-written with PyPy to evaluate how widely applicable the described techniques
+written in RPython to evaluate how widely applicable the described techniques
 are. Possible targets for such an evaluation would be the SPy-VM, a Smalltalk
 implementation \cite{bolz_back_2008}, a Prolog interpreter or PyGirl, a Gameboy
 emulator \cite{XXX}; but also less immediately obvious ones, like Python's