[pypy-svn] r43834 - pypy/extradoc/talk/dls2007
arigo at codespeak.net
arigo at codespeak.net
Tue May 29 09:36:03 CEST 2007
Author: arigo
Date: Tue May 29 09:36:02 2007
New Revision: 43834
Added:
pypy/extradoc/talk/dls2007/
pypy/extradoc/talk/dls2007/acm_proc_article-sp.cls
- copied unchanged from r43833, pypy/extradoc/talk/dls2006/acm_proc_article-sp.cls
pypy/extradoc/talk/dls2007/paper.tex
Log:
Partial LaTeX-ification of D08.2. That's 9 pages for now.
Added: pypy/extradoc/talk/dls2007/paper.tex
==============================================================================
--- (empty file)
+++ pypy/extradoc/talk/dls2007/paper.tex Tue May 29 09:36:02 2007
@@ -0,0 +1,1015 @@
+\documentclass{acm_proc_article-sp}
+
+\begin{document}
+
+\title{Generating Just-In-Time Specializing Compilers}
+
+\numberofauthors{2}
+\author{
+\alignauthor Armin Rigo\\
+ \affaddr{Heinrich-Heine-Universität Düsseldorf}\\
+ \affaddr{Institut für Informatik}\\
+ \affaddr{Universitätsstra{\ss}e 1}\\
+ \affaddr{D-40225 Düsseldorf}\\
+ \affaddr{Deutschland}\\
+ \email{arigo at tunes.org}
+\alignauthor Samuele Pedroni\\
+ \affaddr{AB Strakt}\\
+ \affaddr{Norra Ågatan 10A}\\
+ \affaddr{416 64 Göteborg}\\
+ \affaddr{Sweden}\\
+ \email{pedronis at strakt.com}
+}
+\date{31 May 2007}
+\maketitle
+
+%\category{D.3.4}{Programming Languages}{Processors}[code generation,
+%interpreters, run-time environments]
+%\category{F.3.2}{Logics and Meanings of Programs}{Semantics of Programming
+%Languages}[program analysis]
+
+\begin{abstract}
+PyPy's translation tool-chain -- from the interpreter written in RPython
+to generated VMs for low-level platforms -- is now able to extend those
+VMs with an automatically generated dynamic compiler, derived from the
+interpreter. This is achieved by a pragmatic application of partial
+evaluation techniques guided by a few hints added to the source of the
+interpreter. Crucial for the effectiveness of dynamic compilation is
+the use of run-time information to improve compilation results: in
+our approach, a novel powerful primitive called "promotion" that "promotes"
+run-time values to compile-time is used to that effect. In this report,
+we describe it along with other novel techniques that allow the approach
+to scale to something as large as PyPy's Python interpreter.
+\end{abstract}
+
+\section{Introduction}
+
+Dynamic compilers are resource costly to write and hard to maintain,
+but highly desirable for competitive performance. Straight-forward
+bytecode interpreters are easier to write. Hybrid approaches have been
+experimented with \cite{REJIT}, but this is clearly an area in need of
+research and innovative approaches.
+
+One of the central goals of the PyPy project is to automatically
+produce dynamic compilers from an interpreter, with as little
+modifications of the interpreter code base itself as possible.
+
+The forest of flow graphs that the translation process \cite{VMCDLS}
+generates and transforms constitutes a reasonable base for the
+necessary analyses. That's a further reason why having a high-level
+runnable and analyzable interpreter implementation was always a
+central tenet of the project: in our approach,
+the dynamic compiler is just another aspect
+transparently introduced by and during the translation
+process.
+
+Partial evaluation techniques should, at least theoretically,
+allow such a derivation of a compiler from an interpreter [PE], but it
+is not reasonable to expect the code produced for an input program by
+a compiler derived using partial evaluation to be very good,
+especially in the case of a dynamic language. Essentially, the input
+program doesn't contain enough information to generate good code; for
+example the input program contains mostly no kind of type
+information in that case.
+
+What is really desired is not to produce a compiler doing static
+ahead-of-time compilation, as classical partial evaluation would do,
+but one capable of dynamic compilation, exploiting run-time
+information in its result. Compilation should be able to suspend, let
+the produced code run to collect run-time information (for example
+language-level types), and then resume with this extra information.
+This will allows the compiler to generate code optimized for the
+effective run-time behaviour of the program.
+
+Inspired by Psyco \cite{PSYCO}, which is a hand-written dynamic compiler
+based on partial evaluation for Python, we developed a technique -
+*promotion* - for our dynamic compiler generator. Simply put, promotion
+on a value stops compilation and waits until the run-time reaches this
+point. When it does, the actual run-time value is promoted into a
+compile-time constant, and compilation resumes with this extra
+information.
+
+Promotion is an essential technique to be able to generate really
+dynamic compilers that can exploit run-time information.
+Besides promotion (section \ref{promotion}),
+the novel techniques introduced by PyPy that allow
+the approach to scale are virtualizable structures
+(section \ref{virtualizable}) and need-oriented
+binding time analysis (section \ref{bta}).
+
+
+\subsection{Overview of partial evaluation}
+
+Partial evaluation is the process of evaluating a function, say ``f(x,
+y)``, with only partial information about the values of its arguments,
+say the value of the ``x`` argument only. This produces a *residual*
+function ``g(y)``, which takes less arguments than the original -- only
+the information not specified during the partial evaluation process needs
+to be provided to the residual function, in this example the ``y``
+argument.
+
+Partial evaluation (PE) comes in two flavors:
+
+* *On-line* PE: a compiler-like algorithm takes the source code of the
+ function ``f(x, y)`` (or its intermediate representation, i.e. its
+ control flow graph in PyPy's terminology), and some partial
+ information, e.g. ``x=5``. From this, it produces the residual
+ function ``g(y)`` directly, by following in which operations the
+ knowledge ``x=5`` can be used, which loops can be unrolled, etc.
+
+* *Off-line* PE: in many cases, the goal of partial evaluation is to
+ improve performance in a specific application. Assume that we have a
+ single known function ``f(x, y)`` in which we think that the value of
+ ``x`` will change slowly during the execution of our program -- much
+ more slowly than the value of ``y``. An obvious example is a loop
+ that calls ``f(x, y)`` many times with always the same value ``x``.
+ We could then use an on-line partial evaluator to produce a ``g(y)``
+ for each new value of ``x``. In practice, the overhead of the partial
+ evaluator might be too large for it to be executed at run-time.
+ However, if we know the function ``f`` in advance, and if we know
+ *which* arguments are the ones that we will want to partially evaluate
+ ``f`` with, then we do not need a full compiler-like analysis of ``f``
+ every time the value of ``x`` changes. We can precompute once and for
+ all a specialized function ``f1(x)``, which when called produces the
+ residual function ``g(y)`` corresponding to ``x``. This is *off-line
+ partial evaluation;* the specialized function ``f1(x)`` is called a
+ *generating extension*.
+
+The PyPy JIT generation framework is based on off-line partial
+evaluation. The function called ``f(x, y)`` above is typically the main
+loop of some interpreter written in RPython. The size of the interpreter can range
+from a three-liner used for testing purposes to the whole of PyPy's
+Python interpreter. In all cases, ``x`` stands for the input program
+(the bytecode to interpret) and ``y`` stands for the input data (like a
+frame object with the binding of the input arguments and local
+variables). Our framework is capable of automatically producing the
+corresponding generating extension ``f1(x)``, which takes an input
+program only and produces a residual function ``g(y)``. This ``f1(x)``
+is a compiler\footnote{
+ What we get in PyPy is more precisely a \emph{just-in-time compiler:}
+ if promotion is used, compiling ahead of time is not possible.
+}
+for the very same language for which ``f(x, y)`` is
+an interpreter.
+
+Off-line partial evaluation is based on *binding-time analysis,* which
+is the process of determining among the variables used in a function (or
+a set of functions) which ones are going to be known in advance and
+which ones are not. In the example of ``f(x, y)``, such an analysis
+would be able to infer that the constantness of the argument ``x``
+implies the constantness of many intermediate values used in the
+function. The *binding time* of a variable determines how early the
+value of the variable will be known.
+
+Once binding times have been determined, one possible approach to
+producing the generating extension itself is by self-applying on-line
+partial evaluators. This is known as the second Futamura projection
+\cite{FU}. So far it is unclear if this approach can lead to optimal
+results, or even if it scales well. In PyPy we selected a more direct
+approach: the generating extension is produced by transformation of the
+control flow graphs of the interpreter, guided by the binding times. We
+call this process *timeshifting*.
+
+
+\section{Architecture and Principles}
+
+PyPy contains a framework for generating just-in-time compilers using
+off-line partial evaluation. As such, there are three distinct phases:
+
+* *Translation time:* during the normal translation of an RPython
+ program, say PyPy's Python interpreter, we perform binding-time
+ analysis and off-line specialization ("timeshifting") of the
+ interpreter. This produces a generating extension, which is linked
+ with the rest of the program.
+
+* *Compile time:* during the execution of the program, when a new
+ bytecode is about to be interpreted, the generating extension is
+ invoked instead. As the generating extension is a compiler, all the
+ computations it performs are called compile-time computations. Its
+ sole effect is to produce residual code.
+
+* *Run time:* the normal execution of the program (which includes the
+ time spent running the residual code created by the generating
+ extension).
+
+Translation time is a purely off-line phase; compile time and run time
+are actually highly interleaved during the execution of the program.
+
+
+\subsection{Binding Time Analysis}
+\label{bta}
+
+At translation time, PyPy performs binding-time analysis of the source
+RPython program after it has been turned to low-level graphs, i.e. at
+the level at which operations manipulate pointer-and-structure-like
+objects.
+
+The binding-time terminology that we are using in PyPy is based on the
+colors that we use when displaying the control flow graphs:
+
+* *Green* variables contain values that are known at compile-time;
+* *Red* variables contain values that are not known until run-time.
+
+The binding-time analyzer of our translation tool-chain is based on the
+same type inference engine that is used on the source RPython program,
+the annotator. In this mode, it is called the *hint-annotator*; it
+operates over input graphs that are already low-level instead of
+RPython-level, and propagates annotations that do not track types but
+value dependencies and manually-provided binding time hints.
+
+The normal process of the hint-annotator is to propagate the binding
+time (i.e. color) of the variables using the following kind of rules:
+
+* For a foldable operation (i.e. one without side effect and which
+ depends only on its argument values), if all arguments are green,
+ then the result can be green too.
+
+* Non-foldable operations always produce a red result.
+
+* At join points, where multiple possible values (depending on control
+ flow) are meeting into a fresh variable, if any incoming value comes
+ from a red variable, the result is red. Otherwise, the color of the
+ result might be green. We do not make it eagerly green, because of
+ the control flow dependency: the residual function is basically a
+ constant-folded copy of the source function, so it might retain some
+ of the same control flow. The value that needs to be stored in the
+ fresh join variable thus depends on which branches are taken in the
+ residual graph.
+
+\subsubsection*{Hints}
+
+Our goal in designing our approach to binding-time analysis was to
+minimize the number of explicit hints that the user must provide in
+the source of the RPython program. This minimalism was not pushed to
+extremes, though, to keep the hint-annotator reasonably simple.
+
+The driving idea was that hints should be need-oriented. Indeed, in a
+program like an interpreter, there are a small number of places where it
+would be clearly beneficial for a given value to be known at
+compile-time, i.e. green: this is where we require the hints to be
+added.
+
+The hint-annotator assumes that all variables are red by default, and
+then propagates annotations that record dependency information.
+When encountering the user-provided hints, the dependency information
+is used to make some variables green. All
+hints are in the form of an operation ``hint(v1, someflag=True)``
+which semantically just returns its first argument unmodified.
+
+The crucial need-oriented hint is ``v2 = hint(v1, concrete=True)``
+which should be used in places where the programmer considers the
+knowledge of the value to be essential. This hint is interpreted by
+the hint-annotator as a request for both ``v1`` and ``v2`` to be green. It
+has a *global* effect on the binding times: it means that not only
+``v1`` but all the values that ``v1`` depends on -- recursively --
+are forced to be green. The hint-annotator complains if the
+dependencies of ``v1`` include a value that cannot be green, like
+a value read out of a field of a non-immutable structure.
+
+Such a need-oriented backward propagation has advantages over the
+commonly used forward propagation, in which a variable is compile-time
+if and only if all the variables it depends on are also compile-time. A
+known issue with forward propagation is that it may mark as compile-time
+either more variables than expected (which leads to over-specialization
+of the residual code), or less variables than expected (preventing
+specialization to occur where it would be the most useful). Our
+need-oriented approach reduces the problem of over-specialization, and
+it prevents under-specialization: an unsatisfiable ``hint(v1,
+concrete=True)`` is reported as an error.
+
+In our context, though, such an error can be corrected. This is done by
+promoting a well-chosen variable among the ones that ``v1`` depends on.
+
+Promotion is invoked with the use of a hint as well:
+``v2 = hint(v1, promote=True)``.
+This hint is a *local* request for ``v2`` to be green, without
+requiring ``v1`` to be green. Note that this amounts to copying
+a red value into a green one, which is not possible in classical
+approaches to partial evaluation. See section \ref{promotion} for a
+complete discussion of promotion.
+
+For examples and further discussion on how the hints are applied in practice
+see `Make your own JIT compiler` \cite{D08.1}.
+
+\subsection{Timeshifting}
+
+Once binding times (colors) have been assigned to all variables in a
+family of control flow graphs, the next step is to mutate the graphs\footnote{
+ One should keep in mind that the program described as the "source RPython
+ program" in this document is typically an interpreter -- the canonical
+ example is that it is the whole PyPy Standard Interpreter. This
+ program is meant to execute at run-time, and directly compute the
+ intended result and side-effects. The translation process transforms
+ it into a forest of flow graphs. These are the flow graphs that
+ timeshifting processes (and not the application-level program, which typically
+ cannot be expressed as low-level flow graphs).
+}
+accordingly in order to produce a generating extension. We call
+this process *timeshifting* because it changes the time at
+which the graphs are meant to be run, from run-time to compile-time.
+
+Despite the execution time and side-effects shift to produce only
+residual code, the timeshifted graphs have a shape (flow of control)
+that is closely related to that of the original graphs. This is because
+at compile-time the timeshifted graphs go over all the operations that
+the original graphs would have performed at run-time, following the same
+control flow; some of these operations and control flow constructs are
+constant-folded at compile-time, and the rest is turned into equivalent
+residual code. Another point of view is that as the timeshifted graphs
+form a generating extension, they perform the equivalent of an abstract
+interpretation of the original graphs over a domain containing
+compile-time values and run-time value locations.
+
+The rest of this section describes this timeshifting process in more
+detail.
+
+\subsubsection*{Red and Green Operations}
+
+The basic idea of timeshifting is to transform operations in a way that
+depends on the color of their operands and result. Variables themselves
+need to be represented based on their color:
+
+* The red (run-time) variables have abstract values at compile-time;
+ no actual value is available for them during compile-time. For them
+ we use a boxed representation that can carry either a run-time storage
+ location (a stack frame position or a register name) or an immediate
+ constant (for when the value is, after all, known at compile-time).
+
+* On the other hand, the green variables are the ones that can carry
+ their value already at compile-time, so they are left untouched during
+ timeshifting.
+
+The operations of the original graphs are then transformed as follows:
+
+* If an operation has no side effect nor any other run-time dependency, and
+ if it only involves green operands, then it can stay unmodified in the
+ graph. In this case, the operation that was run-time in the original
+ graph becomes a compile-time operation, and it will never be generated
+ in the residual code. (This is the case that makes the whole approach
+ worthwhile: some operations become purely compile-time.)
+
+* In all other cases, the operation might have to be generated in the
+ residual code. In the timeshifted graph it is replaced by a call
+ to a helper which will generate a residual operation manipulating
+ the input run-time values and return a new boxed representation
+ for the run-time result location.
+
+These helpers will constant-fold the operation if the inputs
+are immediate constants and if the operation has no side-effects. Immediate constants can occur even though the
+corresponding variable in the graph was red: a variable can be
+dynamically found to contain a compile-time constant at a particular
+point in (compile)-time, independently of the hint-annotator
+proving that it is always the case.
+In Partial Evaluation terminology, the timeshifted graphs are
+performing some *on-line* partial evaluation in addition to the
+off-line job enabled by the hint-annotator.
+
+\subsubsection*{Merges and Splits}
+
+The timeshifted code carries around an object that stores the
+compilation-time state -- mostly the current bindings of the variables.
+This state is used to shape the control flow of the generated residual
+code, as follows.
+
+After a *split,* i.e. after a conditional branch that could not be
+folded at compile-time, the compilation state is duplicated and both
+branches are compiled independently. Conversely, after a *merge point,*
+i.e. when two control flow paths meet each other, we try to join the two
+paths in the residual code. This part is more difficult because the two
+paths may need to be compiled with different variable bindings -- e.g.
+different variables may be known to take different compile-time constant
+values in the two branches. The two paths can either be kept separate
+or merged; in the latter case, the merged compilation-time state needs
+to be a generalization (*widening*) of the two already-seen states.
+Deciding when to do each is a classical problem of partial evaluation,
+as merging too eagerly may loose important precision and not merging
+eagerly enough may create too many redundant residual code paths (to the
+point of preventing termination of the compiler).
+
+So far, we did not investigate this problem in detail. We settled for a
+simple widening heuristic: two different compile-time constants merge as
+a run-time value, but we try to preserve the richer models of run-time
+information that are enabled by the techniques described in the sequel
+(promotion (\ref{promotion}), virtual structures (\ref{virtual})...).
+This heuristic seems to work
+for PyPy to some extent.
+
+\subsubsection*{Calls and inlining}
+
+For calls timeshifting can either produce code to generate a residual
+call operation or recursively invoke the timeshifted version of the
+callee. The residual operations generated by the timeshifted callee
+will grow the compile-time produced residual function; this
+effectively amounts to the compile-time inlining of the original callee into
+its caller. This is the default behaviour for calls within the
+user-controlled subset of original graphs of the interpreter that are
+timeshifted. Inlining only stops at re-entrant calls to the
+interpreter main loop; the net result is that at the level of the
+interpreted language, each function (or method) gets compiled into
+a single piece of residual code.
+
+\subsection{Promotion}
+\label{promotion}
+
+In the sequel, we describe in more details one of the main new
+techniques introduced in our approach, which we call *promotion*. In
+short, it allows an arbitrary run-time value to be turned into a
+compile-time value at any point in time. Each promotion point is
+explicitly defined with a hint that must be put in the source code of
+the interpreter.
+
+From a partial evaluation point of view, promotion is the converse of
+the operation generally known as "lift". Lifting a value means
+copying a variable whose binding time is compile-time into a variable
+whose binding time is run-time -- it corresponds to the compiler
+"forgetting" a particular value that it knew about. By contrast,
+promotion is a way for the compiler to gain *more* information about
+the run-time execution of a program. Clearly, this requires
+fine-grained feedback from run-time to compile-time, thus a
+dynamic setting.
+
+Promotion requires interleaving compile-time and run-time phases,
+otherwise the compiler can only use information that is known ahead of
+time. It is impossible in the "classical" approaches to partial
+evaluation, in which the compiler always runs fully ahead of execution
+This is a problem in many large use cases. For example, in an
+interpreter for a dynamic language, there is mostly no information
+that can be clearly and statically used by the compiler before any
+code has run.
+
+A more theoretical way to see the issue is to consider that the
+possible binding time for each variable in the interpreter is
+constrained by the binding time of the other variables it depends on.
+For some kind of interpreters this set of constraints may have no
+interesting global solution -- if most variables can ultimately depend
+on a value, even in just one corner case, which cannot be
+compile-time, then in any solution most variables will be run-time.
+In the presence of promotion, though, these constraints can be
+occasionally violated: corner cases do not necessarily have to
+influence the common case, and local solutions can be patched
+together.
+
+A very different point of view on promotion is as a generalization of
+techniques that already exist in dynamic compilers as found in modern
+object-oriented language virtual machines. In this context feedback
+techniques are crucial for good results. The main goal is to
+optimize and reduce the overhead of dynamic dispatching and indirect
+invocation. This is achieved with variations on the technique of
+polymorphic inline caches \cite{PIC}: the dynamic lookups are cached and
+the corresponding generated machine code contains chains of
+compare-and-jump instructions which are modified at run-time. These
+techniques also allow the gathering of information to direct inlining for even
+better optimization results.
+
+In the presence of promotion, dispatch optimization can usually be
+reframed as a partial evaluation task. Indeed, if the type of the
+object being dispatched to is known at compile-time, the lookup can be
+folded, and only a (possibly inlined) direct call remains in the
+generated code. In the case where the type of the object is not known
+at compile-time, it can first be read at run-time out of the object and
+promoted to compile-time. As we will see in the sequel, this produces
+very similar machine code.\footnote{
+ This can also be seen as a generalization of a partial
+ evaluation transformation called "The Trick" (see e.g. \cite{PE}),
+ which again produces similar code but which is only
+ applicable for finite sets of values.
+}
+
+The essential advantage is that it is no longer tied to the details of
+the dispatch semantics of the language being interpreted, but applies in
+more general situations. Promotion is thus the central enabling
+primitive to make timeshifting a practical approach to language
+independent dynamic compiler generation.
+
+\subsubsection*{Promotion in practice}
+
+The implementation of promotion requires a tight coupling between
+compile-time and run-time: a *callback,* put in the generated code,
+which can invoke the compiler again. When the callback is actually
+reached at run-time, and only then, the compiler resumes and uses the
+knowledge of the actual run-time value to generate more code.
+
+The new generated code is potentially different for each run-time value
+seen. This implies that the generated code needs to contain some sort
+of updatable switch, which can pick the right code path based on the
+run-time value.
+
+While this describes the general idea, the details are open to slight
+variations. Let us show more precisely the way the JIT compilers
+produced by PyPy 1.0 work. Our first example is purely artificial:
+
+\begin{verbatim}
+ ...
+ b = a / 10
+ c = hint(b, promote=True)
+ d = c + 5
+ print d
+ ...
+\end{verbatim}
+
+In this example, ``a`` and ``b`` are run-time variables and ``c`` and
+``d`` are compile-time variables; ``b`` is copied into ``c`` via a
+promotion. The division is a run-time operation while the addition is a
+compile-time operation.
+
+The compiler derived from an interpreter containing the above code
+generates the following machine code (in pseudo-assembler notation),
+assuming that ``a`` comes from register ``r1``:
+
+\begin{verbatim}
+ ...
+ r2 = div r1, 10
+ Label1:
+ jump Label2
+ <some reserved space here>
+
+ Label2:
+ call continue_compilation(r2, <state data pointer>)
+ jump Label1
+\end{verbatim}
+
+The first time this machine code runs, the ``continue\_compilation()``
+function resumes the compiler. The two arguments to the function are
+the actual run-time value from the register ``r2``, which the compiler
+will now consider as a compile-time constant, and an immediate pointer
+to data that was generated along with the above code snippet and which
+contains enough information for the compiler to know where and with
+which state it should resume.
+
+Assuming that the first run-time value taken by ``r1`` is, say, 42, then
+the compiler will see ``r2 == 4`` and update the above machine code as
+follows:
+
+\begin{verbatim}
+ ...
+ r2 = div r1, 10
+ Label1:
+ compare r2, 4 # patched
+ jump-if-equal Label3 # patched
+ jump Label2 # patched
+ <less reserved space left>
+
+ Label2:
+ call continue_compilation(r2, <state data pointer>)
+ jump Label1
+
+ Label3: # new code
+ call print(9) # new code
+ ...
+\end{verbatim}
+
+Notice how the addition is constant-folded by the compiler. (Of course,
+in real examples, different promoted values typically make the compiler
+constant-fold complex code path choices in different ways, and not just
+simple operations.) Note also how the code following ``Label1`` is an
+updatable switch which plays the role of a polymorphic inline cache.
+The "polymorphic" terminology does not apply in our context, though, as
+the switch does not necessarily have to be on the type of an object.
+
+After the update, the original call to ``continue\_compilation()``
+returns and execution loops back to the now-patched switch at
+``Label1``. This run and all following runs in which ``r1`` is between
+40 and 49 will thus directly go to ``Label3``. Obviously, if other
+values show up, ``continue\_compilation()`` will be invoked again, so new
+code will be generated and the code at ``Label1`` further patched to
+check for more cases.
+
+If, over the course of the execution of a program, too many cases are
+seen, the reserved space after ``Label1`` will eventually run out.
+Currently, we simply reserve more space elsewhere and patch the final
+jump accordingly. There could be better strategies which which we did
+not implement so far, such as discarding old code and reusing their slots
+in the switch, or sometimes giving up entirely and compiling a general
+version of the code in which the value remains run-time.
+
+
+\subsubsection*{Implementation notes}
+
+The *state data pointer* in the example above contains a snapshot of the
+state of the compiler when it reached the promotion point. Its memory
+impact is potentially large -- a complete continuation for each generated
+switch, which can never be reclaimed because new run-time values may
+always show up later during the execution of the program.
+
+To reduce the problem we compress the state into a so-called *path*.
+The full state is only stored at a few specific points.\footnote{
+ More precisely, at merge points that the user needs to mark
+ as "global". The control flow join point corresponding to the
+ looping of the interpreter main loop is a typical place to put
+ such a global merge point.
+}
+The compiler
+records a trace of the multiple paths it followed from the last full
+snapshot in a lightweight tree structure. The *state data pointer* is
+then only a pointer to a node in the tree; the branch from that node to
+the root describes a path that let the compiler quickly *replay* its
+actions (without generating code again) from the latest full snapshot to
+rebuild its internal state and get back to the original promotion point.
+
+For example, if the interpreter source code contains promotions inside a
+run-time condition:
+
+\begin{verbatim}
+ if condition:
+ ...
+ hint(x, promote=True)
+ ...
+ else:
+ ...
+ hint(y, promote=True)
+ ...
+\end{verbatim}
+
+then the tree will contain three nodes: a root node storing the
+snapshot, a child with a "True case" marker, and another child with a
+"False case" marker. Each promotion point generates a switch and a call
+to ``continue\_compilation()`` pointing to the appropriate child node.
+The compiler can re-reach the correct promotion point by following the
+markers on the branch from the root to the child.
+
+
+\subsection{Virtual structures}
+\label{virtual}
+
+Interpreters for dynamic languages typically allocate a lot of small
+objects, for example due to boxing. For this reason, we
+implemented a way for the compiler to generate residual memory
+allocations as lazily as possible. The idea is to try to keep new
+run-time structures "exploded": instead of a single run-time pointer to
+a heap-allocated data structure, the structure is "virtualized" as a set
+of fresh variables, one per field. In the compiler, the variable that
+would normally contain the pointer to the structure gets instead a
+content that is neither a run-time value nor a compile-time constant,
+but a special *virtual structure* -- a compile-time data structure that
+recursively contains new variables, each of which can again store a
+run-time, a compile-time, or a virtual structure value.
+
+This approach is based on the fact that the "run-time values" carried
+around by the compiler really represent run-time locations -- the name of
+a CPU register or a position in the machine stack frame. This is the
+case for both regular variables and the fields of virtual structures.
+It means that the compilation of a ``getfield`` or ``setfield``
+operation performed on a virtual structure simply loads or stores such a
+location reference into the virtual structure; the actual value is not
+copied around at run-time.
+
+It is not always possible to keep structures virtual. The main
+situation in which it needs to be "forced" (i.e. actually allocated at
+run-time) is when the pointer escapes to some non-virtual location like
+a field of a real heap structure.
+
+Virtual structures still avoid the run-time allocation of most
+short-lived objects, even in non-trivial situations. The following
+example shows a typical case. Consider the Python expression ``a+b+c``.
+Assume that ``a`` contains an integer. The PyPy Python interpreter
+implements application-level integers as boxes -- instances of a
+``W\_IntObject`` class with a single ``intval`` field. Here is the
+addition of two integers:
+
+\begin{verbatim}
+ def add(w1, w2): # w1, w2 are W_IntObject instances
+ value1 = w1.intval
+ value2 = w2.intval
+ result = value1 + value2
+ return W_IntObject(result)
+\end{verbatim}
+
+When interpreting the bytecode for ``a+b+c``, two calls to ``add()`` are
+issued; the intermediate ``W\_IntObject`` instance is built by the first
+call and thrown away after the second call. By contrast, when the
+interpreter is turned into a compiler, the construction of the
+``W\_IntObject`` object leads to a virtual structure whose ``intval``
+field directly references the register in which the run-time addition
+put its result. This location is read out of the virtual structure at
+the beginning of the second ``add()``, and the second run-time addition
+directly operates on the same register.
+
+An interesting effect of virtual structures is that they play nicely with
+promotion. Indeed, before the interpreter can call the proper ``add()``
+function for integers, it must first determine that the two arguments
+are indeed integer objects. In the corresponding dispatch logic, we
+have added two hints to promote the type of each of the two arguments.
+This produces a compiler that has the following behavior: in the general
+case, the expression ``a+b`` will generate two consecutive run-time
+switches followed by the residual code of the proper version of
+``add()``. However, in ``a+b+c``, the virtual structure representing
+the intermediate value will contain a compile-time constant as type.
+Promoting a compile-time constant is trivial -- no run-time code is
+generated. The whole expression ``a+b+c`` thus only requires three
+switches instead of four. It is easy to see that even more switches can
+be skipped in larger examples; typically, in a tight loop manipulating
+only integers, all objects are virtual structures for the compiler and
+the residual code is theoretically optimal -- all type propagation and
+boxing/unboxing occurs at compile-time.
+
+
+\subsection{Virtualizable structures}
+\label{virtualizable}
+
+In the PyPy interpreter there are cases where structures cannot be
+virtual -- because they escape, or are allocated outside the
+JIT-generated code -- but where we would still like to keep the
+"exploding" effect and carry the fields of the structure as local
+variables in the generated code.
+
+It is likely that the same problem occurs more generally in many
+interpreters: the typical example is that of frame objects, which stores
+among other things the value of the local variables of each function
+invocation. Ideally, the effect we would like to achieve is to keep the
+frame object as a purely virtual structure, and the same for the array
+or dictionary implementing the bindings of the locals. Then each local
+variable of the interpreted language can be represented as a separate
+run-time value in the generated code, or be itself further virtualized
+(e.g. as a virtual ``W\_IntObject`` structure as seen above).
+
+The issue is that the frame object is sometimes built in advance by
+non-JIT-generated code; even when it is not, it immediately escapes into
+the global list of frames that is used to support the frame stack
+introspection primitives that Python exposes. In other words, the frame
+object cannot be purely virtual because a pointer to it must be stored
+into a global data structure (even though in practice most of frame
+objects are deallocated without ever having been introspected).
+
+To solve this problem, we introduced *virtualizable structures,* a mix
+between regular run-time structures and virtual structures. A virtualizable structure is a
+structure that exists at run-time in the heap, but that is
+simultaneously treated as virtual by the compiler. Accesses to the
+structure from the code generated by the JIT are virtualized away,
+i.e. don't involve run-time copying. The trade-off is that in order
+to keep both views synchronized, accesses to the run-time structure
+from regular code not produced by the JIT needs to perform an extra
+check.
+
+Because of this trade-off, a hint needs to be inserted manually to mark
+the classes whose instances should be implemented in this way -- the
+class of frame objects, in the case of PyPy. The hint is used by the
+translation toolchain to add a hidden field to all frame objects, and to
+translate all accesses to the object fields into low-level code that
+first checks the hidden field. This is the only case so far in which
+the presence of the JIT compiler imposes a global change to the rest of
+the program during translation.\footnote{
+ This is not a problem per se, as it is anyway just a small
+ extension to the translation framework, but it imposes a performance
+ overhead to all code manipulating frame objects. To mitigate this, we
+ added a way to declare during RPython type inference that the
+ indirection check is not needed in some parts of the code where we know
+ that the frame object cannot have a virtual counterpart.
+}
+
+The hidden field is set when the frame structure enters JIT-generated
+code, and cleared when it leaves. When a recursive call to
+non-JIT-generated code finds a structure with the field set, it invokes
+a JIT-generated callback to perform the reading or updating of the field
+from the point of view of its virtual structure representation. The
+actual fields in the heap structure are not used during this time.
+
+The effect that can be obtained in this way is that although frame
+objects are still allocated in the heap, most of them will always remain
+essentially empty. A pointer to these empty frames is pushed into and
+popped off the global frame list, allowing the introspection mechanisms
+to still work perfectly.
+
+
+\subsection{Other implementation details}
+
+We quickly mention below a few other features and implementation details
+of the implementation of the JIT generation framework. More information
+can be found in the on-line documentation.
+
+* There are more user-specified hints available, like *deep-freezing,*
+ which marks an object as immutable in order to allow accesses to
+ its content to be constant-folded at compile-time.
+
+* The compiler representation of a run-time value for a non-virtual
+ structure may additionally remember that some fields are actually
+ compile-time constants. This occurs for example when a field is
+ read from the structure at run-time and then promoted to compile-time.
+
+* In addition to virtual structures, lists and dictionaries can also be
+ virtual.
+
+* Exception handling is achieved by inserting explicit operations into
+ the graphs before they are timeshifted. Most of these run-time
+ exception manipulations are then virtualized away, by treating the
+ exception state as virtual.
+
+* Timeshifting is performed in two phases: a first step transforms the
+ graphs by updating their control flow and inserting pseudo-operations
+ to drive the compiler; a second step (based on the RTyper \cite{D05.1})
+ replaces all necessary operations by calls to support code.
+
+* The support code implements the generic behaviour of the compiler,
+ e.g. the merge logic. It is about 3500 lines of RPython code. The
+ rest of the hint-annotator and timeshifter is about 3800 lines of
+ Python code.
+
+* The machine code backends (two so far, Intel IA32 and PowerPC) are
+ about 3500 further lines of RPython code each. There is a
+ well-defined interface between the JIT compiler support code and the
+ backends, making writing new backends relatively easy. The unusual
+ part of the interface is the support for the run-time updatable
+ switches.
+
+
+\subsection{Open issues}
+
+Here are what we think are the most important points that will need
+attention in order to make the approach more robust:
+
+* The timeshifted graphs currently compile many branches eagerly. This
+ can easily result in residual code explosion. Depending on the source
+ interpreter this can also result in non-termination issues, where
+ compilation never completes. The opposite extreme would be to always
+ compile branches lazily, when they are about to be executed, as Psyco
+ does. While this neatly sidesteps termination issues, the best
+ solution is probably something in between these extremes.
+
+* As described in the Promotion section (\ref{promotion}),
+ we need fall-back solutions for when the
+ number of promoted run-time values seen at a particular point becomes
+ too large.
+
+* We need more flexible control about what to inline or not to inline in
+ the residual code.
+
+* The widening heuristics for merging needs to be refined.
+
+* The JIT generation framework needs to be made aware of some other
+ translation-time aspects \cite{D05.4} \cite{D07.1} in order to produce the
+ correct residual code (e.g. code calling the correct Garbage
+ Collection routines or supporting Stackless-style stack unwinding).
+
+* We did not work yet on profile-directed identification of program hot
+ spots. Currently, the interpreter must decide when to invoke the JIT
+ or not (which can itself be based on explicit requests from the interpreted
+ program).
+
+* The machine code backends can be improved.
+
+The latter point opens an interesting future research direction: can we
+layer our kind of JIT compiler on top of a virtual machine that already
+contains a lower-level JIT compiler? In other words, can we delegate
+the difficult questions of machine code generation to a lower
+independent layer, e.g. inlining, re-optimization of frequently executed
+code, etc.? What changes would be required to an existing virtual
+machine, e.g. a Java Virtual Machine, to support this?
+
+
+\section{Results}
+
+The following test function is an example of purely arithmetic code
+written in Python, which the PyPy JIT can run extremely fast:
+
+\begin{verbatim}
+ def f1(n):
+ "Arbitrary test function."
+ i = 0
+ x = 1
+ while i<n:
+ j = 0
+ while j<=i:
+ j = j + 1
+ x = x + (i&j)
+ i = i + 1
+ return x
+\end{verbatim}
+
+We measured the time required to compute ``f1(2117)`` on the following
+interpreters:
+
+* Python 2.4.4, the standard CPython implementation.
+
+* A version of pypy-c including a generated JIT compiled.
+
+* gcc 4.1.1 compiling the above function rewritten in C (which, unlike
+ the other two, does not do any overflow checking on the arithmetic
+ operations).
+
+The relative results have been found to vary by 25\% depending on the
+machine. On our reference benchmark machine, a 4-cores Intel(R)
+Xeon(TM) CPU 3.20GHz with 5GB of RAM, we obtained the following results
+(the numbers in parenthesis are the slow-down ratio relative to the
+unoptimized gcc compilation):
+
++-----------------------------------------+------------------+
+| Interpreter | Seconds per call |
++=========================================+==================+
+| Python 2.4.4 | 0.82 (132x) |
++-----------------------------------------+------------------+
+| Python 2.4.4 with Psyco 1.5.2 | 0.0062 (1.00x) |
++-----------------------------------------+------------------+
+| pypy-c with the JIT turned off | 1.77 (285x) |
++-----------------------------------------+------------------+
+| pypy-c with the JIT turned on | 0.0091 (1.47x) |
++-----------------------------------------+------------------+
+| gcc | 0.0062 (1x) |
++-----------------------------------------+------------------+
+| gcc -O2 | 0.0022 (0.35x) |
++-----------------------------------------+------------------+
+
+
+This table shows that the PyPy JIT is able to generate residual code
+that runs within the same order of magnitude as an unoptimizing gcc. It
+shows that all the abstraction overhead has been correctly removed from
+the residual code; the remaining slow-downs are only due to a suboptimal
+low-level machine code generation backend. We have thus reached our
+goal of automatically generating a JIT whose performance is similar to
+the hand-written Psyco without having its limitations.\footnote{
+ As mentioned above, Psyco gives up compiling Python functions
+ if they use constructs it does not support, and is not 100\%
+ compatible with introspection of frames. By construction the
+ PyPy JIT does not have these limitations. The PyPy JIT is
+ also easier to retarget, and already supports more architectures
+ than Psyco does, namely the Intel Mac OS/X and the PowerPC Mac OS/X.
+}
+
+In particular, the ratio of 1.47x between the unoptimizing gcc and the
+PyPy JIT matches the target of 1.5x that we set ourselves as our goal
+within the duration of the EU project. We should also mention that on
+an Intel-based Mac OS/X machine we have measured this ratio to be as low
+as 1.15x.
+
+.. comment
+
+ python 2.4: 0.82
+ pypy-c-42412-allworking-tproxy: 1.90392203331
+ pypy-c-41802-faassen: 1.60267777443
+ pypy-c-41992-jit, turned off: 1.76742062569
+ pypy-c-41992-jit: 0.00912307977676 <===
+ psyco 1.5.2: 0.00621596813202
+ gcc -O0: 0.0061992
+ gcc -O2: 0.0021689
+
+
+\section{Conclusion}
+
+Producing the results described in the previous section requires the
+generated compiler to completely cut the overhead and fold at
+compile-time some rather involved lookup algorithms like Python's binary
+operation dispatch. Promotion proved itself to be sufficiently
+powerful to achieve this. Other features we introduced allowed to
+preserve information about intermediate value types, to avoid their
+boxing and to propagate them in the CPU stack and registers.
+
+Some slight reorganisation of the interpreter main loop without semantics
+influence, marking the frames as virtualizable (\ref{virtualizable}),
+and adding hints at
+a few crucial points was all that was necessary for our Python
+interpreter.
+
+We think that our results make viable an approach to implement dynamic
+languages that needs only a straight-forward bytecode interpreter to
+be written. Dynamic compilers would be generated automatically guided
+by the placement of hints.
+
+These implementations should stay flexible and evolvable. Dynamic
+compiler would be robust against language changes up to the need to
+maintain and possibly change the hints.
+
+We consider this as a major breakthrough in term of the possibilities
+it opens for language design and implementation; it was one of the
+main goals of the research program within the PyPy project.
+
+
+%.. References (title not necessary, latex generates it)
+%
+%.. [D05.1] `Compiling Dynamic Language Implementations`, PyPy EU-Report, 2005
+%
+%.. [D05.4] `Encapsulating Low-Level Aspects`, PyPy EU-Report, 2005
+%
+%.. [D07.1] `Support for Massive Parallelism, Optimisation results, Practical
+% Usages and Approaches for Translation Aspects`, PyPy EU-Report,
+% 2006
+%
+%.. [D08.1] `Release a JIT Compiler for PyPy Including Processor Backends
+% for Intel and PowerPC`, PyPy EU-Report, 2007
+%
+%.. [FU] `Partial evaluation of compuation process -- an approach to a
+% compiler-compiler`, Yoshihito Futamura, Higher-Order and
+% Symbolic Computation, 12(4):363-397, 1999. Reprinted from
+% Systems Computers Controls 2(5), 1971
+%
+%.. [PE] `Partial evaluation and automatic program generation`,
+% Neil D. Jones, Carsten K. Gomard, Peter Sestoft,
+% Prentice-Hall, Inc., Upper Saddle River, NJ, 1993
+%
+%.. [REJIT] `Retargeting JIT Compilers by using C-Compiler Generated
+% Executable Code`, M. Anton Ertl, David Gregg, Proc. of
+% the 13th Intl. Conf. on Parallel Architectures and
+% Compilation Techniques, 2004.
+%
+%.. [PIC] `Optimizing Dynamically-Typed Object-Oriented Languages With
+% Polymorphic Inline Caches`, U. Hölzle, C. Chambers, D. Ungar,
+% ECOOP'91 Conference Proceedings, Geneva, 1991.
+%
+%.. [PSYCO] `Representation-based just-in-time specialization and the
+% psyco prototype for python`, Armin Rigo, in PEPM '04: Proceedings
+% of the 2004 ACM SIGPLAN symposium on Partial evaluation and
+% semantics-based program manipulation, pp. 15-26, ACM Press, 2004
+%
+%.. [VMCDLS] `PyPy's approach to virtual machine construction`, Armin Rigo,
+% Samuele Pedroni, in OOPSLA '06: Companion to the 21st ACM SIGPLAN
+% conference on Object-oriented programming languages, systems, and
+% applications, pp. 944-953, ACM Press, 2006
+
+\end{document}
More information about the Pypy-commit
mailing list