[pypy-svn] r43848 - pypy/extradoc/talk/dls2007

Tue May 29 15:31:42 CEST 2007

Author: arigo
Date: Tue May 29 15:31:41 2007
New Revision: 43848

Added:
   pypy/extradoc/talk/dls2007/paper.bib
      - copied, changed from r43833, pypy/extradoc/talk/dyla2007/dyla.bib
Modified:
   pypy/extradoc/talk/dls2007/Makefile
   pypy/extradoc/talk/dls2007/paper.tex
Log:
Finish LaTeXification, complete things a bit.


Modified: pypy/extradoc/talk/dls2007/Makefile
==============================================================================

--- pypy/extradoc/talk/dls2007/Makefile	(original)
+++ pypy/extradoc/talk/dls2007/Makefile	Tue May 29 15:31:41 2007
@@ -1,7 +1,7 @@
 
-paper.pdf: paper.tex #paper.bib image/*.pdf
-	#pdflatex paper
-	#bibtex paper
+paper.pdf: paper.tex paper.bib
+	pdflatex paper
+	bibtex paper
 	pdflatex paper
 	pdflatex paper
 

Modified: pypy/extradoc/talk/dls2007/paper.tex
==============================================================================
--- pypy/extradoc/talk/dls2007/paper.tex	(original)
+++ pypy/extradoc/talk/dls2007/paper.tex	Tue May 29 15:31:41 2007
@@ -52,21 +52,36 @@
 experimented with \cite{REJIT}, but this is clearly an area in need of
 research and innovative approaches.
 
-One of the central goals of the PyPy project is to automatically
+One of the central goals of the PyPy project \cite{PyPy} is to automatically
 produce dynamic compilers from an interpreter, with as little
 modifications of the interpreter code base itself as possible.
 
-The forest of flow graphs that the translation process \cite{VMCDLS}
-generates and transforms constitutes a reasonable base for the
-necessary analyses.  That's a further reason why having a high-level
-runnable and analyzable interpreter implementation was always a
-central tenet of the project: in our approach,
-the dynamic compiler is just another aspect
-transparently introduced by and during the translation
-process.
+PyPy contains a complete interpreter for the Python language, written in
+a high-level language, RPython, which is a subset of Python amenable to
+static analysis.  It also contains a translation toolchain for compiling
+this interpreter to either C (or C-like) environments, or to the higher
+level environments provided by general-purpose virtual machines like
+Java's and .NET.  The translation toolchain can input any RPython
+program, although our focus was on the translation of RPython programs
+that are interpreters for dynamic languages.\footnote{We also have an
+interpreter for Prolog and the beginning of one for JavaScript.}
+
+The translation framework uses control flow graphs in SSI format as its
+intermediate representation (SSI is a stricter subset of SSA).  The
+details of this process are beyond the scope of the present paper, and
+have been presented in \cite{pypyvmconstruction}.
+The present paper describes a
+special optional transformation that we integrated with this translation
+framework: deriving a dynamic compiler from the interpreter.  In other
+words, our translation framework is able to input an interpreter for any
+language (it works best for dynamic languages); as long as it is
+written in RPython and contains a small number of extra hints, then it
+can produce from it a complete virtual machine
+\emph{that contains a just-in-time compiler for the dynamic language.}
 
 Partial evaluation techniques should, at least theoretically,
-allow such a derivation of a compiler from an interpreter [PE], but it
+allow such a derivation of a compiler from an interpreter
+\cite{partial-evaluation}, but it
 is not reasonable to expect the code produced for an input program by
 a compiler derived using partial evaluation to be very good,
 especially in the case of a dynamic language.  Essentially, the input
@@ -83,9 +98,9 @@
 This will allows the compiler to generate code optimized for the
 effective run-time behaviour of the program.
 
-Inspired by Psyco \cite{PSYCO}, which is a hand-written dynamic compiler
+Inspired by Psyco \cite{psyco-paper}, which is a hand-written dynamic compiler
 based on partial evaluation for Python, we developed a technique -
-*promotion* - for our dynamic compiler generator. Simply put, promotion
+\emph{promotion} - for our dynamic compiler generator. Simply put, promotion
 on a value stops compilation and waits until the run-time reaches this
 point.  When it does, the actual run-time value is promoted into a
 compile-time constant, and compilation resumes with this extra
@@ -102,98 +117,113 @@
 
 \subsection{Overview of partial evaluation}
 
-Partial evaluation is the process of evaluating a function, say ``f(x,
-y)``, with only partial information about the values of its arguments,
-say the value of the ``x`` argument only.  This produces a *residual*
-function ``g(y)``, which takes less arguments than the original -- only
+\def\code#1{\texttt{#1}}
+
+Partial evaluation is the process of evaluating a function, say \code{f(x,
+y)}, with only partial information about the values of its arguments,
+say the value of the \code{x} argument only.  This produces a \emph{residual}
+function \code{g(y)}, which takes less arguments than the original -- only
 the information not specified during the partial evaluation process needs
-to be provided to the residual function, in this example the ``y``
+to be provided to the residual function, in this example the \code{y}
 argument.
 
 Partial evaluation (PE) comes in two flavors:
+%
+\begin{enumerate}
 
-* *On-line* PE: a compiler-like algorithm takes the source code of the
-  function ``f(x, y)`` (or its intermediate representation, i.e. its
+\item\emph{On-line PE:} a compiler-like algorithm takes the source code of the
+  function \code{f(x, y)} (or its intermediate representation, i.e.\ its
   control flow graph in PyPy's terminology), and some partial
-  information, e.g. ``x=5``.  From this, it produces the residual
-  function ``g(y)`` directly, by following in which operations the
-  knowledge ``x=5`` can be used, which loops can be unrolled, etc.
+  information, e.g.\ \code{x=5}.  From this, it produces the residual
+  function \code{g(y)} directly, by following in which operations the
+  knowledge \code{x=5} can be used, which loops can be unrolled, etc.
 
-* *Off-line* PE: in many cases, the goal of partial evaluation is to
+\item\emph{Off-line PE:} in many cases, the goal of partial evaluation is to
   improve performance in a specific application.  Assume that we have a
-  single known function ``f(x, y)`` in which we think that the value of
-  ``x`` will change slowly during the execution of our program -- much
-  more slowly than the value of ``y``.  An obvious example is a loop
-  that calls ``f(x, y)`` many times with always the same value ``x``.
-  We could then use an on-line partial evaluator to produce a ``g(y)``
-  for each new value of ``x``.  In practice, the overhead of the partial
+  single known function \code{f(x, y)} in which we think that the value of
+  \code{x} will change slowly during the execution of our program -- much
+  more slowly than the value of \code{y}.  An obvious example is a loop
+  that calls \code{f(x, y)} many times with always the same value \code{x}.
+  We could then use an on-line partial evaluator to produce a \code{g(y)}
+  for each new value of \code{x}.  In practice, the overhead of the partial
   evaluator might be too large for it to be executed at run-time.
-  However, if we know the function ``f`` in advance, and if we know
-  *which* arguments are the ones that we will want to partially evaluate
-  ``f`` with, then we do not need a full compiler-like analysis of ``f``
-  every time the value of ``x`` changes.  We can precompute once and for
-  all a specialized function ``f1(x)``, which when called produces the
-  residual function ``g(y)`` corresponding to ``x``.  This is *off-line
-  partial evaluation;* the specialized function ``f1(x)`` is called a
-  *generating extension*.
+  However, if we know the function \code{f} in advance, and if we know
+  \emph{which} arguments are the ones that we will want to partially evaluate
+  \code{f} with, then we do not need a full compiler-like analysis of \code{f}
+  every time the value of \code{x} changes.  We can precompute once and for
+  all a specialized function \code{f1(x)}, which when called produces the
+  residual function \code{g(y)} corresponding to \code{x}.  This is
+  \emph{off-line partial evaluation;} the specialized function \code{f1(x)}
+  is called a \emph{generating extension.}
+
+\end{enumerate}
 
 The PyPy JIT generation framework is based on off-line partial
-evaluation.  The function called ``f(x, y)`` above is typically the main
+evaluation.  The function called \code{f(x, y)} above is typically the main
 loop of some interpreter written in RPython.  The size of the interpreter can range
 from a three-liner used for testing purposes to the whole of PyPy's
-Python interpreter.  In all cases, ``x`` stands for the input program
-(the bytecode to interpret) and ``y`` stands for the input data (like a
+Python interpreter.  In all cases, \code{x} stands for the input program
+(the bytecode to interpret) and \code{y} stands for the input data (like a
 frame object with the binding of the input arguments and local
 variables).  Our framework is capable of automatically producing the
-corresponding generating extension ``f1(x)``, which takes an input
-program only and produces a residual function ``g(y)``.  This ``f1(x)``
+corresponding generating extension \code{f1(x)}, which takes an input
+program only and produces a residual function \code{g(y)}.  This \code{f1(x)}
 is a compiler\footnote{
     What we get in PyPy is more precisely a \emph{just-in-time compiler:}
     if promotion is used, compiling ahead of time is not possible.
 }
-for the very same language for which ``f(x, y)`` is
+for the very same language for which \code{f(x, y)} is
 an interpreter.
 
-Off-line partial evaluation is based on *binding-time analysis,* which
+Off-line partial evaluation is based on \emph{binding-time analysis,} which
 is the process of determining among the variables used in a function (or
 a set of functions) which ones are going to be known in advance and
-which ones are not.  In the example of ``f(x, y)``, such an analysis
-would be able to infer that the constantness of the argument ``x``
+which ones are not.  In the example of \code{f(x, y)}, such an analysis
+would be able to infer that the constantness of the argument \code{x}
 implies the constantness of many intermediate values used in the
-function.  The *binding time* of a variable determines how early the
+function.  The \emph{binding time} of a variable determines how early the
 value of the variable will be known.
 
 Once binding times have been determined, one possible approach to
 producing the generating extension itself is by self-applying on-line
 partial evaluators.  This is known as the second Futamura projection
-\cite{FU}.  So far it is unclear if this approach can lead to optimal
+\cite{Futamura}.  So far it is unclear if this approach can lead to optimal
 results, or even if it scales well.  In PyPy we selected a more direct
 approach: the generating extension is produced by transformation of the
 control flow graphs of the interpreter, guided by the binding times.  We
-call this process *timeshifting*.
+call this process \emph{timeshifting.}
+
+
+\subsection{Related work}
+
+XXX PE; Psyco; REJIT; ?
 
 
 \section{Architecture and Principles}
 
 PyPy contains a framework for generating just-in-time compilers using
 off-line partial evaluation.  As such, there are three distinct phases:
+%
+\begin{enumerate}
 
-* *Translation time:* during the normal translation of an RPython
+\item\emph{Translation time:} during the normal translation of an RPython
   program, say PyPy's Python interpreter, we perform binding-time
   analysis and off-line specialization ("timeshifting") of the
   interpreter.  This produces a generating extension, which is linked
   with the rest of the program.
 
-* *Compile time:* during the execution of the program, when a new
+\item\emph{Compile time:} during the execution of the program, when a new
   bytecode is about to be interpreted, the generating extension is
   invoked instead.  As the generating extension is a compiler, all the
   computations it performs are called compile-time computations.  Its
   sole effect is to produce residual code.
 
-* *Run time:* the normal execution of the program (which includes the
+\item\emph{Run time:} the normal execution of the program (which includes the
   time spent running the residual code created by the generating
   extension).
 
+\end{enumerate}
+
 Translation time is a purely off-line phase; compile time and run time
 are actually highly interleaved during the execution of the program.
 
@@ -202,33 +232,37 @@
 \label{bta}
 
 At translation time, PyPy performs binding-time analysis of the source
-RPython program after it has been turned to low-level graphs, i.e. at
+RPython program after it has been turned to low-level graphs, i.e.\ at
 the level at which operations manipulate pointer-and-structure-like
 objects.
 
 The binding-time terminology that we are using in PyPy is based on the
 colors that we use when displaying the control flow graphs:
-
-* *Green* variables contain values that are known at compile-time;
-* *Red* variables contain values that are not known until run-time.
+%
+\begin{itemize}
+\item\emph{Green} variables contain values that are known at compile-time;
+\item\emph{Red} variables contain values that are not known until run-time.
+\end{itemize}
 
 The binding-time analyzer of our translation tool-chain is based on the
 same type inference engine that is used on the source RPython program,
-the annotator.  In this mode, it is called the *hint-annotator*; it
+the annotator.  In this mode, it is called the \emph{hint-annotator;} it
 operates over input graphs that are already low-level instead of
 RPython-level, and propagates annotations that do not track types but
 value dependencies and manually-provided binding time hints.
 
 The normal process of the hint-annotator is to propagate the binding
-time (i.e. color) of the variables using the following kind of rules:
+time (i.e.\ color) of the variables using the following kind of rules:
+%
+\begin{itemize}
 
-* For a foldable operation (i.e. one without side effect and which
+\item For a foldable operation (i.e.\ one without side effect and which
   depends only on its argument values), if all arguments are green,
   then the result can be green too.
 
-* Non-foldable operations always produce a red result.
+\item Non-foldable operations always produce a red result.
 
-* At join points, where multiple possible values (depending on control
+\item At join points, where multiple possible values (depending on control
   flow) are meeting into a fresh variable, if any incoming value comes
   from a red variable, the result is red.  Otherwise, the color of the
   result might be green.  We do not make it eagerly green, because of
@@ -238,6 +272,8 @@
   fresh join variable thus depends on which branches are taken in the
   residual graph.
 
+\end{itemize}
+
 \subsubsection*{Hints}
 
 Our goal in designing our approach to binding-time analysis was to
@@ -248,24 +284,25 @@
 The driving idea was that hints should be need-oriented.  Indeed, in a
 program like an interpreter, there are a small number of places where it
 would be clearly beneficial for a given value to be known at
-compile-time, i.e. green: this is where we require the hints to be
+compile-time, i.e.\ green: this is where we require the hints to be
 added.
 
 The hint-annotator assumes that all variables are red by default, and
 then propagates annotations that record dependency information.
 When encountering the user-provided hints, the dependency information
 is used to make some variables green.  All
-hints are in the form of an operation ``hint(v1, someflag=True)``
+hints are in the form of an operation \code{hint(v1, someflag=True)}
 which semantically just returns its first argument unmodified.
 
-The crucial need-oriented hint is ``v2 = hint(v1, concrete=True)``
+The crucial need-oriented hint is
+$$\code{v2 = hint(v1, concrete=True)}$$
 which should be used in places where the programmer considers the
 knowledge of the value to be essential.  This hint is interpreted by
-the hint-annotator as a request for both ``v1`` and ``v2`` to be green.  It
-has a *global* effect on the binding times: it means that not only
-``v1`` but all the values that ``v1`` depends on -- recursively --
+the hint-annotator as a request for both \code{v1} and \code{v2} to be green.  It
+has a \emph{global} effect on the binding times: it means that not only
+\code{v1} but all the values that \code{v1} depends on -- recursively --
 are forced to be green.  The hint-annotator complains if the
-dependencies of ``v1`` include a value that cannot be green, like
+dependencies of \code{v1} include a value that cannot be green, like
 a value read out of a field of a non-immutable structure.
 
 Such a need-oriented backward propagation has advantages over the
@@ -276,22 +313,23 @@
 of the residual code), or less variables than expected (preventing
 specialization to occur where it would be the most useful).  Our
 need-oriented approach reduces the problem of over-specialization, and
-it prevents under-specialization: an unsatisfiable ``hint(v1,
-concrete=True)`` is reported as an error.
+it prevents under-specialization: an unsatisfiable \code{hint(v1,
+concrete=True)} is reported as an error.
 
 In our context, though, such an error can be corrected.  This is done by
-promoting a well-chosen variable among the ones that ``v1`` depends on.
+promoting a well-chosen variable among the ones that \code{v1} depends on.
 
 Promotion is invoked with the use of a hint as well:
-``v2 = hint(v1, promote=True)``.
-This hint is a *local* request for ``v2`` to be green, without
-requiring ``v1`` to be green.  Note that this amounts to copying
+\code{v2 = hint(v1, promote=True)}.
+This hint is a \emph{local} request for \code{v2} to be green, without
+requiring \code{v1} to be green.  Note that this amounts to copying
 a red value into a green one, which is not possible in classical
 approaches to partial evaluation.  See section \ref{promotion} for a
 complete discussion of promotion.
 
 For examples and further discussion on how the hints are applied in practice
-see `Make your own JIT compiler` \cite{D08.1}.
+see \emph{Make your own JIT compiler} at
+\code{http://codespeak.net/pypy/dist/pypy/doc/jit.html}. % XXX check url
 
 \subsection{Timeshifting}
 
@@ -307,7 +345,7 @@
     cannot be expressed as low-level flow graphs).
 }
 accordingly in order to produce a generating extension.  We call
-this process *timeshifting* because it changes the time at
+this process \emph{timeshifting} because it changes the time at
 which the graphs are meant to be run, from run-time to compile-time.
 
 Despite the execution time and side-effects shift to produce only
@@ -330,32 +368,40 @@
 The basic idea of timeshifting is to transform operations in a way that
 depends on the color of their operands and result.  Variables themselves
 need to be represented based on their color:
+%
+\begin{itemize}
 
-* The red (run-time) variables have abstract values at compile-time;
+\item The red (run-time) variables have abstract values at compile-time;
   no actual value is available for them during compile-time. For them
   we use a boxed representation that can carry either a run-time storage
   location (a stack frame position or a register name) or an immediate
   constant (for when the value is, after all, known at compile-time).
 
-* On the other hand, the green variables are the ones that can carry
+\item On the other hand, the green variables are the ones that can carry
   their value already at compile-time, so they are left untouched during
   timeshifting.
 
+\end{itemize}
+
 The operations of the original graphs are then transformed as follows:
+%
+\begin{itemize}
 
-* If an operation has no side effect nor any other run-time dependency, and
+\item If an operation has no side effect nor any other run-time dependency, and
   if it only involves green operands, then it can stay unmodified in the
   graph.  In this case, the operation that was run-time in the original
   graph becomes a compile-time operation, and it will never be generated
   in the residual code.  (This is the case that makes the whole approach
   worthwhile: some operations become purely compile-time.)
 
-* In all other cases, the operation might have to be generated in the
+\item In all other cases, the operation might have to be generated in the
   residual code.  In the timeshifted graph it is replaced by a call
   to a helper which will generate a residual operation manipulating
   the input run-time values and return a new boxed representation
   for the run-time result location.
 
+\end{itemize}
+
 These helpers will constant-fold the operation if the inputs
 are immediate constants and if the operation has no side-effects. Immediate constants can occur even though the
 corresponding variable in the graph was red: a variable can be
@@ -363,7 +409,7 @@
 point in (compile)-time, independently of the hint-annotator
 proving that it is always the case.
 In Partial Evaluation terminology, the timeshifted graphs are
-performing some *on-line* partial evaluation in addition to the
+performing some \emph{on-line} partial evaluation in addition to the
 off-line job enabled by the hint-annotator.
 
 \subsubsection*{Merges and Splits}
@@ -373,16 +419,16 @@
 This state is used to shape the control flow of the generated residual
 code, as follows.
 
-After a *split,* i.e. after a conditional branch that could not be
+After a \emph{split,} i.e.\ after a conditional branch that could not be
 folded at compile-time, the compilation state is duplicated and both
-branches are compiled independently.  Conversely, after a *merge point,*
-i.e. when two control flow paths meet each other, we try to join the two
+branches are compiled independently.  Conversely, after a \emph{merge point,}
+i.e.\ when two control flow paths meet each other, we try to join the two
 paths in the residual code.  This part is more difficult because the two
-paths may need to be compiled with different variable bindings -- e.g.
-different variables may be known to take different compile-time constant
+paths may need to be compiled with different variable bindings --
+e.g.\ different variables may be known to take different compile-time constant
 values in the two branches.  The two paths can either be kept separate
 or merged; in the latter case, the merged compilation-time state needs
-to be a generalization (*widening*) of the two already-seen states.
+to be a generalization \emph{(widening)} of the two already-seen states.
 Deciding when to do each is a classical problem of partial evaluation,
 as merging too eagerly may loose important precision and not merging
 eagerly enough may create too many redundant residual code paths (to the
@@ -414,7 +460,7 @@
 \label{promotion}
 
 In the sequel, we describe in more details one of the main new
-techniques introduced in our approach, which we call *promotion*.  In
+techniques introduced in our approach, which we call \emph{promotion.}  In
 short, it allows an arbitrary run-time value to be turned into a
 compile-time value at any point in time.  Each promotion point is
 explicitly defined with a hint that must be put in the source code of
@@ -425,7 +471,7 @@
 copying a variable whose binding time is compile-time into a variable
 whose binding time is run-time -- it corresponds to the compiler
 "forgetting" a particular value that it knew about.  By contrast,
-promotion is a way for the compiler to gain *more* information about
+promotion is a way for the compiler to gain \emph{more} information about
 the run-time execution of a program. Clearly, this requires
 fine-grained feedback from run-time to compile-time, thus a
 dynamic setting.
@@ -457,7 +503,8 @@
 techniques are crucial for good results.  The main goal is to
 optimize and reduce the overhead of dynamic dispatching and indirect
 invocation.  This is achieved with variations on the technique of
-polymorphic inline caches \cite{PIC}: the dynamic lookups are cached and
+polymorphic inline caches \cite{polymorphic-inline-caches}:
+the dynamic lookups are cached and
 the corresponding generated machine code contains chains of
 compare-and-jump instructions which are modified at run-time.  These
 techniques also allow the gathering of information to direct inlining for even
@@ -472,7 +519,7 @@
 promoted to compile-time.  As we will see in the sequel, this produces
 very similar machine code.\footnote{
     This can also be seen as a generalization of a partial
-    evaluation transformation called "The Trick" (see e.g. \cite{PE}),
+    evaluation transformation called "The Trick" (see e.g.\ \cite{partial-evaluation}),
     which again produces similar code but which is only
     applicable for finite sets of values.
 }
@@ -486,7 +533,7 @@
 \subsubsection*{Promotion in practice}
 
 The implementation of promotion requires a tight coupling between
-compile-time and run-time: a *callback,* put in the generated code,
+compile-time and run-time: a \emph{callback,} put in the generated code,
 which can invoke the compiler again.  When the callback is actually
 reached at run-time, and only then, the compiler resumes and uses the
 knowledge of the actual run-time value to generate more code.
@@ -499,85 +546,86 @@
 While this describes the general idea, the details are open to slight
 variations.  Let us show more precisely the way the JIT compilers
 produced by PyPy 1.0 work.  Our first example is purely artificial:
-
+%
 \begin{verbatim}
-        ...
-        b = a / 10
-        c = hint(b, promote=True)
-        d = c + 5
-        print d
-        ...
+    ...
+    b = a / 10
+    c = hint(b, promote=True)
+    d = c + 5
+    print d
+    ...
 \end{verbatim}
 
-In this example, ``a`` and ``b`` are run-time variables and ``c`` and
-``d`` are compile-time variables; ``b`` is copied into ``c`` via a
+In this example, \code{a} and \code{b} are run-time variables and \code{c} and
+\code{d} are compile-time variables; \code{b} is copied into \code{c} via a
 promotion.  The division is a run-time operation while the addition is a
 compile-time operation.
 
 The compiler derived from an interpreter containing the above code
 generates the following machine code (in pseudo-assembler notation),
-assuming that ``a`` comes from register ``r1``:
-
+assuming that \code{a} comes from register \code{r1}:
+%
 \begin{verbatim}
-     ...
-        r2 = div r1, 10
-     Label1:
-        jump Label2
-        <some reserved space here>
-
-     Label2:
-        call continue_compilation(r2, <state data pointer>)
-        jump Label1
+ ...
+    r2 = div r1, 10
+ Label1:
+    jump Label2
+    <some reserved space here>
+
+ Label2:
+    call continue_compilation(r2, <state data ptr>)
+    jump Label1
 \end{verbatim}
 
-The first time this machine code runs, the ``continue\_compilation()``
-function resumes the compiler.  The two arguments to the function are
-the actual run-time value from the register ``r2``, which the compiler
+The first time this machine code runs, the function called
+\code{continue\_compilation()}
+resumes the compiler.  The two arguments to the function are
+the actual run-time value from the register \code{r2}, which the compiler
 will now consider as a compile-time constant, and an immediate pointer
 to data that was generated along with the above code snippet and which
 contains enough information for the compiler to know where and with
 which state it should resume.
 
-Assuming that the first run-time value taken by ``r1`` is, say, 42, then
-the compiler will see ``r2 == 4`` and update the above machine code as
+Assuming that the first run-time value taken by \code{r1} is, say, 42, then
+the compiler will see \code{r2 == 4} and update the above machine code as
 follows:
-
+%
 \begin{verbatim}
-     ...
-        r2 = div r1, 10
-     Label1:
-        compare r2, 4            # patched
-        jump-if-equal Label3     # patched
-        jump Label2              # patched
-        <less reserved space left>
-
-     Label2:
-        call continue_compilation(r2, <state data pointer>)
-        jump Label1
-
-     Label3:                     # new code
-        call print(9)            # new code
-        ...
+ ...
+    r2 = div r1, 10
+ Label1:
+    compare r2, 4            # patched
+    jump-if-equal Label3     # patched
+    jump Label2              # patched
+    <less reserved space left>
+
+ Label2:
+    call continue_compilation(r2, <state data ptr>)
+    jump Label1
+
+ Label3:                     # new code
+    call print(9)            # new code
+    ...
 \end{verbatim}
 
 Notice how the addition is constant-folded by the compiler.  (Of course,
 in real examples, different promoted values typically make the compiler
 constant-fold complex code path choices in different ways, and not just
-simple operations.)  Note also how the code following ``Label1`` is an
+simple operations.)  Note also how the code following \code{Label1} is an
 updatable switch which plays the role of a polymorphic inline cache.
 The "polymorphic" terminology does not apply in our context, though, as
 the switch does not necessarily have to be on the type of an object.
 
-After the update, the original call to ``continue\_compilation()``
+After the update, the original call to \code{continue\_compilation()}
 returns and execution loops back to the now-patched switch at
-``Label1``.  This run and all following runs in which ``r1`` is between
-40 and 49 will thus directly go to ``Label3``.  Obviously, if other
-values show up, ``continue\_compilation()`` will be invoked again, so new
-code will be generated and the code at ``Label1`` further patched to
+\code{Label1}.  This run and all following runs in which \code{r1} is between
+40 and 49 will thus directly go to \code{Label3}.  Obviously, if other
+values show up, \code{continue\_compilation()} will be invoked again, so new
+code will be generated and the code at \code{Label1} further patched to
 check for more cases.
 
 If, over the course of the execution of a program, too many cases are
-seen, the reserved space after ``Label1`` will eventually run out.
+seen, the reserved space after \code{Label1} will eventually run out.
 Currently, we simply reserve more space elsewhere and patch the final
 jump accordingly.  There could be better strategies which which we did
 not implement so far, such as discarding old code and reusing their slots
@@ -587,13 +635,13 @@
 
 \subsubsection*{Implementation notes}
 
-The *state data pointer* in the example above contains a snapshot of the
+The state data pointer in the example above contains a snapshot of the
 state of the compiler when it reached the promotion point.  Its memory
 impact is potentially large -- a complete continuation for each generated
 switch, which can never be reclaimed because new run-time values may
 always show up later during the execution of the program.
 
-To reduce the problem we compress the state into a so-called *path*.
+To reduce the problem we compress the state into a so-called \emph{path.}
 The full state is only stored at a few specific points.\footnote{
     More precisely, at merge points that the user needs to mark
     as "global".  The control flow join point corresponding to the
@@ -602,15 +650,15 @@
 }
 The compiler
 records a trace of the multiple paths it followed from the last full
-snapshot in a lightweight tree structure.  The *state data pointer* is
+snapshot in a lightweight tree structure.  The state data pointer is
 then only a pointer to a node in the tree; the branch from that node to
-the root describes a path that let the compiler quickly *replay* its
+the root describes a path that let the compiler quickly \emph{replay} its
 actions (without generating code again) from the latest full snapshot to
 rebuild its internal state and get back to the original promotion point.
 
 For example, if the interpreter source code contains promotions inside a
 run-time condition:
-
+%
 \begin{verbatim}
         if condition:
             ...
@@ -625,7 +673,7 @@
 then the tree will contain three nodes: a root node storing the
 snapshot, a child with a "True case" marker, and another child with a
 "False case" marker.  Each promotion point generates a switch and a call
-to ``continue\_compilation()`` pointing to the appropriate child node.
+to \code{continue\_compilation()} pointing to the appropriate child node.
 The compiler can re-reach the correct promotion point by following the
 markers on the branch from the root to the child.
 
@@ -642,7 +690,7 @@
 of fresh variables, one per field.  In the compiler, the variable that
 would normally contain the pointer to the structure gets instead a
 content that is neither a run-time value nor a compile-time constant,
-but a special *virtual structure* -- a compile-time data structure that
+but a special \emph{virtual structure} -- a compile-time data structure that
 recursively contains new variables, each of which can again store a
 run-time, a compile-time, or a virtual structure value.
 
@@ -650,54 +698,54 @@
 around by the compiler really represent run-time locations -- the name of
 a CPU register or a position in the machine stack frame.  This is the
 case for both regular variables and the fields of virtual structures.
-It means that the compilation of a ``getfield`` or ``setfield``
+It means that the compilation of a \code{getfield} or \code{setfield}
 operation performed on a virtual structure simply loads or stores such a
 location reference into the virtual structure; the actual value is not
 copied around at run-time.
 
 It is not always possible to keep structures virtual.  The main
-situation in which it needs to be "forced" (i.e. actually allocated at
+situation in which it needs to be "forced" (i.e.\ actually allocated at
 run-time) is when the pointer escapes to some non-virtual location like
 a field of a real heap structure.
 
 Virtual structures still avoid the run-time allocation of most
 short-lived objects, even in non-trivial situations.  The following
-example shows a typical case.  Consider the Python expression ``a+b+c``.
-Assume that ``a`` contains an integer.  The PyPy Python interpreter
+example shows a typical case.  Consider the Python expression \code{a+b+c}.
+Assume that \code{a} contains an integer.  The PyPy Python interpreter
 implements application-level integers as boxes -- instances of a
-``W\_IntObject`` class with a single ``intval`` field.  Here is the
+\code{W\_IntObject} class with a single \code{intval} field.  Here is the
 addition of two integers:
-
+%
 \begin{verbatim}
-    def add(w1, w2):            # w1, w2 are W_IntObject instances
-        value1 = w1.intval
-        value2 = w2.intval
-        result = value1 + value2
-        return W_IntObject(result)
+  def add(w1, w2):          # w1, w2 are instances
+      value1 = w1.intval    # of W_IntObject
+      value2 = w2.intval
+      result = value1 + value2
+      return W_IntObject(result)
 \end{verbatim}
 
-When interpreting the bytecode for ``a+b+c``, two calls to ``add()`` are
-issued; the intermediate ``W\_IntObject`` instance is built by the first
+When interpreting the bytecode for \code{a+b+c}, two calls to \code{add()} are
+issued; the intermediate \code{W\_IntObject} instance is built by the first
 call and thrown away after the second call.  By contrast, when the
 interpreter is turned into a compiler, the construction of the
-``W\_IntObject`` object leads to a virtual structure whose ``intval``
+\code{W\_IntObject} object leads to a virtual structure whose \code{intval}
 field directly references the register in which the run-time addition
 put its result.  This location is read out of the virtual structure at
-the beginning of the second ``add()``, and the second run-time addition
+the beginning of the second \code{add()}, and the second run-time addition
 directly operates on the same register.
 
 An interesting effect of virtual structures is that they play nicely with
-promotion.  Indeed, before the interpreter can call the proper ``add()``
+promotion.  Indeed, before the interpreter can call the proper \code{add()}
 function for integers, it must first determine that the two arguments
 are indeed integer objects.  In the corresponding dispatch logic, we
 have added two hints to promote the type of each of the two arguments.
 This produces a compiler that has the following behavior: in the general
-case, the expression ``a+b`` will generate two consecutive run-time
+case, the expression \code{a+b} will generate two consecutive run-time
 switches followed by the residual code of the proper version of
-``add()``.  However, in ``a+b+c``, the virtual structure representing
+\code{add()}.  However, in \code{a+b+c}, the virtual structure representing
 the intermediate value will contain a compile-time constant as type.
 Promoting a compile-time constant is trivial -- no run-time code is
-generated.  The whole expression ``a+b+c`` thus only requires three
+generated.  The whole expression \code{a+b+c} thus only requires three
 switches instead of four.  It is easy to see that even more switches can
 be skipped in larger examples; typically, in a tight loop manipulating
 only integers, all objects are virtual structures for the compiler and
@@ -722,7 +770,7 @@
 or dictionary implementing the bindings of the locals.  Then each local
 variable of the interpreted language can be represented as a separate
 run-time value in the generated code, or be itself further virtualized
-(e.g. as a virtual ``W\_IntObject`` structure as seen above).
+(e.g.\ as a virtual \code{W\_IntObject} structure as seen above).
 
 The issue is that the frame object is sometimes built in advance by
 non-JIT-generated code; even when it is not, it immediately escapes into
@@ -732,12 +780,12 @@
 into a global data structure (even though in practice most of frame
 objects are deallocated without ever having been introspected).
 
-To solve this problem, we introduced *virtualizable structures,* a mix
+To solve this problem, we introduced \emph{virtualizable structures,} a mix
 between regular run-time structures and virtual structures.  A virtualizable structure is a
 structure that exists at run-time in the heap, but that is
 simultaneously treated as virtual by the compiler.  Accesses to the
 structure from the code generated by the JIT are virtualized away,
-i.e.  don't involve run-time copying.  The trade-off is that in order
+i.e.\ don't involve run-time copying.  The trade-off is that in order
 to keep both views synchronized, accesses to the run-time structure
 from regular code not produced by the JIT needs to perform an extra
 check.
@@ -776,92 +824,52 @@
 
 We quickly mention below a few other features and implementation details
 of the implementation of the JIT generation framework.  More information
-can be found in the on-line documentation.
+can be found in the on-line documentation \cite{PyPy}.  % => ref to web site
+%
+\begin{itemize}
 
-* There are more user-specified hints available, like *deep-freezing,*
+\item There are more user-specified hints available, like \emph{deep-freezing,}
   which marks an object as immutable in order to allow accesses to
   its content to be constant-folded at compile-time.
 
-* The compiler representation of a run-time value for a non-virtual
+\item The compiler representation of a run-time value for a non-virtual
   structure may additionally remember that some fields are actually
   compile-time constants.  This occurs for example when a field is
   read from the structure at run-time and then promoted to compile-time.
 
-* In addition to virtual structures, lists and dictionaries can also be
+\item In addition to virtual structures, lists and dictionaries can also be
   virtual.
 
-* Exception handling is achieved by inserting explicit operations into
+\item Exception handling is achieved by inserting explicit operations into
   the graphs before they are timeshifted.  Most of these run-time
   exception manipulations are then virtualized away, by treating the
   exception state as virtual.
 
-* Timeshifting is performed in two phases: a first step transforms the
+\item Timeshifting is performed in two phases: a first step transforms the
   graphs by updating their control flow and inserting pseudo-operations
   to drive the compiler; a second step (based on the RTyper \cite{D05.1})
   replaces all necessary operations by calls to support code.
 
-* The support code implements the generic behaviour of the compiler,
-  e.g. the merge logic.  It is about 3500 lines of RPython code.  The
+\item The support code implements the generic behaviour of the compiler,
+  e.g.\ the merge logic.  It is about 3500 lines of RPython code.  The
   rest of the hint-annotator and timeshifter is about 3800 lines of
   Python code.
 
-* The machine code backends (two so far, Intel IA32 and PowerPC) are
+\item The machine code backends (two so far, Intel IA32 and PowerPC) are
   about 3500 further lines of RPython code each.  There is a
   well-defined interface between the JIT compiler support code and the
   backends, making writing new backends relatively easy.  The unusual
   part of the interface is the support for the run-time updatable
   switches.
 
-
-\subsection{Open issues}
-
-Here are what we think are the most important points that will need
-attention in order to make the approach more robust:
-
-* The timeshifted graphs currently compile many branches eagerly.  This
-  can easily result in residual code explosion.  Depending on the source
-  interpreter this can also result in non-termination issues, where
-  compilation never completes.  The opposite extreme would be to always
-  compile branches lazily, when they are about to be executed, as Psyco
-  does.  While this neatly sidesteps termination issues, the best
-  solution is probably something in between these extremes.
-
-* As described in the Promotion section (\ref{promotion}),
-  we need fall-back solutions for when the
-  number of promoted run-time values seen at a particular point becomes
-  too large.
-
-* We need more flexible control about what to inline or not to inline in
-  the residual code.
-
-* The widening heuristics for merging needs to be refined.
-
-* The JIT generation framework needs to be made aware of some other
-  translation-time aspects \cite{D05.4} \cite{D07.1} in order to produce the
-  correct residual code (e.g. code calling the correct Garbage
-  Collection routines or supporting Stackless-style stack unwinding).
-
-* We did not work yet on profile-directed identification of program hot
-  spots.  Currently, the interpreter must decide when to invoke the JIT
-  or not (which can itself be based on explicit requests from the interpreted
-  program).
-
-* The machine code backends can be improved.
-
-The latter point opens an interesting future research direction: can we
-layer our kind of JIT compiler on top of a virtual machine that already
-contains a lower-level JIT compiler?  In other words, can we delegate
-the difficult questions of machine code generation to a lower
-independent layer, e.g. inlining, re-optimization of frequently executed
-code, etc.?  What changes would be required to an existing virtual
-machine, e.g. a Java Virtual Machine, to support this?
+\end{itemize}
 
 
 \section{Results}
 
 The following test function is an example of purely arithmetic code
 written in Python, which the PyPy JIT can run extremely fast:
-
+%
 \begin{verbatim}
    def f1(n):
        "Arbitrary test function."
@@ -876,39 +884,40 @@
        return x
 \end{verbatim}
 
-We measured the time required to compute ``f1(2117)`` on the following
+We measured the time required to compute \code{f1(2117)} on the following
 interpreters:
+%
+\begin{itemize}
 
-* Python 2.4.4, the standard CPython implementation.
+\item Python 2.4.4, the standard CPython implementation.
 
-* A version of pypy-c including a generated JIT compiled.
+\item A version of pypy-c (our Python interpreter translated to a stand-alone
+  executable via C) including a generated JIT compiled.
 
-* gcc 4.1.1 compiling the above function rewritten in C (which, unlike
+\item gcc 4.1.1 compiling the above function rewritten in C (which, unlike
   the other two, does not do any overflow checking on the arithmetic
   operations).
 
+\end{itemize}
+
 The relative results have been found to vary by 25\% depending on the
 machine.  On our reference benchmark machine, a 4-cores Intel(R)
 Xeon(TM) CPU 3.20GHz with 5GB of RAM, we obtained the following results
 (the numbers in parenthesis are the slow-down ratio relative to the
 unoptimized gcc compilation):
 
-+-----------------------------------------+------------------+
-| Interpreter                             | Seconds per call |
-+=========================================+==================+
-| Python 2.4.4                            | 0.82    (132x)   |
-+-----------------------------------------+------------------+
-| Python 2.4.4 with Psyco 1.5.2           | 0.0062  (1.00x)  |
-+-----------------------------------------+------------------+
-| pypy-c with the JIT turned off          | 1.77    (285x)   |
-+-----------------------------------------+------------------+
-| pypy-c with the JIT turned on           | 0.0091  (1.47x)  | 
-+-----------------------------------------+------------------+
-| gcc                                     | 0.0062  (1x)     |
-+-----------------------------------------+------------------+
-| gcc -O2                                 | 0.0022  (0.35x)  |
-+-----------------------------------------+------------------+
-
+\begin{tabular}{|l|ll|}
+\hline
+Interpreter & \multicolumn{2}{|c|}{Seconds per call} \\
+\hline
+Python 2.4.4                            & 0.82   & (132x)   \\
+Python 2.4.4 with Psyco 1.5.2           & 0.0062 & (1.00x)  \\
+pypy-c with the JIT turned off          & 1.77   & (285x)   \\
+pypy-c with the JIT turned on           & 0.0091 & (1.47x)  \\
+gcc                                     & 0.0062 & (1x)     \\
+gcc -O2                                 & 0.0022 & (0.35x)  \\
+\hline
+\end{tabular}
 
 This table shows that the PyPy JIT is able to generate residual code
 that runs within the same order of magnitude as an unoptimizing gcc.  It
@@ -932,6 +941,54 @@
 as 1.15x.
 
 
+\section{Future work}
+
+Here are what we think are the most important points that will need
+attention in order to make the approach more robust:
+%
+\begin{itemize}
+
+\item The timeshifted graphs currently compile many branches eagerly.  This
+  can easily result in residual code explosion.  Depending on the source
+  interpreter this can also result in non-termination issues, where
+  compilation never completes.  The opposite extreme would be to always
+  compile branches lazily, when they are about to be executed, as Psyco
+  does.  While this neatly sidesteps termination issues, the best
+  solution is probably something in between these extremes.
+
+\item As described in the Promotion section (\ref{promotion}),
+  we need fall-back solutions for when the
+  number of promoted run-time values seen at a particular point becomes
+  too large.
+
+\item We need more flexible control about what to inline or not to inline in
+  the residual code.
+
+\item The widening heuristics for merging needs to be refined.
+
+\item The JIT generation framework needs to be made aware of some other
+  translation-time aspects in order to produce the correct residual code
+  (e.g.\ code calling the correct Garbage Collection routines or
+  supporting Stackless-style stack unwinding \cite{D07.1}).
+
+\item We did not work yet on profile-directed identification of program hot
+  spots.  Currently, the interpreter must decide when to invoke the JIT
+  or not (which can itself be based on explicit requests from the interpreted
+  program).
+
+\item The machine code backends can be improved.
+
+\end{itemize}
+
+The latter point opens an interesting future research direction: can we
+layer our kind of JIT compiler on top of a virtual machine that already
+contains a lower-level JIT compiler?  In other words, can we delegate
+the difficult questions of machine code generation to a lower
+independent layer, e.g.\ inlining, re-optimization of frequently executed
+code, etc.?  What changes would be required to an existing virtual
+machine, e.g.\ a Java Virtual Machine, to support this?
+
+
 \section{Conclusion}
 
 Producing the results described in the previous section requires the
@@ -943,8 +1000,8 @@
 boxing and to propagate them in the CPU stack and registers.
 
 Some slight reorganisation of the interpreter main loop without semantics
-influence, marking the frames as virtualizable (\ref{virtualizable}),
-and adding hints at
+influence, marking the frames as virtualizable
+(section \ref{virtualizable}), and adding hints at
 a few crucial points was all that was necessary for our Python
 interpreter.
 
@@ -957,9 +1014,24 @@
 compiler would be robust against language changes up to the need to
 maintain and possibly change the hints.
 
-We consider this as a major breakthrough in term of the possibilities
-it opens for language design and implementation; it was one of the
-main goals of the research program within the PyPy project.
+We consider this as a major breakthrough in term of the possibilities it
+opens for language design and implementation; it was one of the main
+goals of the research program within the PyPy project.  Only groups with
+very large amounts of resources can affort the high costs of writing
+just-in-time compilers from scratch.  Communities with limited available
+resources for the implementation and maintenance of a language, such as,
+generally, academic and open source projects, cannot afford such costs
+-- and even when experimental just-in-time compilers exist, the mere
+fact of having to maintain them in parallel with other implementations
+is taxing for such communities, particularly when the languages in
+question evolve quickly.  In the PyPy approach, from a single simple
+implementation for the language, we can generate stand-alone virtual
+machines whose performance far excess that of traditional hand-written
+virtual machines (like CPython, the reference C implementation of
+Python); with the generation of a dynamic compiler, we achieve
+state-of-the-art performance.
+
+% XXX balance columns
 
 
 %.. References (title not necessary, latex generates it)
@@ -975,7 +1047,7 @@
 %.. [D08.1] `Release a JIT Compiler for PyPy Including Processor Backends
 %           for Intel and PowerPC`, PyPy EU-Report, 2007
 %
-%.. [FU]    `Partial evaluation of compuation process -- an approach to a
+%.. [FU]    `Partial evaluation of computation process -- an approach to a
 %           compiler-compiler`, Yoshihito Futamura, Higher-Order and
 %           Symbolic Computation, 12(4):363-397, 1999.  Reprinted from
 %           Systems Computers Controls 2(5), 1971
@@ -1003,8 +1075,7 @@
 %           conference on Object-oriented programming languages, systems, and
 %           applications, pp. 944-953, ACM Press, 2006
 
-\bigskip
-
+% ---- Bibliography ----
 \bibliographystyle{abbrv}
 \bibliography{paper}