[pypy-commit] pypy default: Check in extradoc/talk/bucharest2016/jit-backend-8vhY1ArTsh.txt in the

Wed Sep 21 16:24:40 EDT 2016

Author: Armin Rigo <arigo at tunes.org>
Branch: 
Changeset: r87292:1e5c1b905a1c
Date: 2016-09-21 22:24 +0200
http://bitbucket.org/pypy/pypy/changeset/1e5c1b905a1c/

Log:	Check in extradoc/talk/bucharest2016/jit-backend-8vhY1ArTsh.txt in
	the docs, as a draft description of our low-level JIT assembler
	backends

diff --git a/rpython/doc/jit/backend.rst b/rpython/doc/jit/backend.rst
new file mode 100644
--- /dev/null
+++ b/rpython/doc/jit/backend.rst
@@ -0,0 +1,263 @@
+=========================
+PyPy's assembler backends
+=========================
+
+Draft notes about the organization of assembler backends in the PyPy JIT, in 2016
+=================================================================================
+
+
+input: linear sequence of instructions, called a "trace".
+
+A trace is a sequence of instructions in SSA form.  Most instructions
+correspond to one or a few CPU-level instructions.  There are a few
+meta-instructions like `label` and debugging stuff.  All branching is
+done with guards, which are instructions that check that a condition is
+true and exit the trace if not.  A failing guard can have a new trace
+added to it later, called a "bridge".  A patched guard becomes a direct
+`Jcond` instruction going to the bridge, with no indirection, no
+register spilling, etc.
+
+A trace ends with either a `return` or a `jump to label`.  The target
+label is either inside the same trace, or in some older one.  For
+historical reasons we call a "loop" a trace that is not a bridge.  The
+machine code that we generate is organized as a forest of trees; the
+trunk of the tree is a "loop", and the branches are all bridges
+(branching off the trunk or off another branch).
+
+* every trunk or branch that ends in a `jump to label` can target a
+  label from a different tree, too.
+
+* the whole process of assembling a loop or a branch is basically
+  single-threaded, so no synchronization issue there (including to patch
+  older generated instructions).
+
+* the generated assembler has got a "frame" in %rbp, which is actually
+  not on the stack at all, but is a GC object (called a "jitframe").
+  Spilling goes there.
+
+* the guards are `Jcond` to a very small piece of generated code, which
+  is basically pushing a couple of constants on the stack and then
+  jumping to the general guard-recovery code.  That code will save the
+  registers into the jitframe and then exit the whole generated
+  function.  The caller of that generated function checks how it
+  finished: if it finished by hitting a guard, then the caller is
+  responsible for calling the "blackhole interpreter".  This is the part
+  of the front-end that recovers from failing guards and finishes
+  running the frame (including, possibly, by jumping again into
+  generated assembler).
+
+
+Details about the JITting process:
+
+* front-end and optimization pass
+
+* rewrite (includes gc related transformation as well as simplifactions)
+
+* assembler generation
+
+
+Front-end and optimization pass
+-------------------------------
+
+Not discussed here in detail.  This produces loops and bridges using an
+instruction set that is "high-level" in some sense: it contains
+intructions like "new"/"new_array", and
+"setfield"/"setarrayitem"/"setinteriorfield" which describe the action
+of storing a value in a precise field of the structure or array.  For
+example, the "setfield" action might require implicitly a GC write
+barrier.  This is the high-level trace that we send to the following
+step.
+
+
+Rewrite
+-------
+
+A mostly but not completely CPU-independent phase: lowers some
+instructions.  For example, the variants of "new" are lowered to
+"malloc" and a few "gc_store": it bumps the pointer of the GC and then
+sets a few fields explicitly in the newly allocated structure.  The
+"setfield" is replaced with a "cond_gc_wb_call" (conditional call to the
+write barrier) if needed, followed by a "gc_store".
+
+The "gc_store" instruction can be encoded in a single MOV assembler
+instruction, but is not as flexible as a MOV.  The address is always
+specified as "some GC pointer + an offset".  We don't have the notion of
+interior pointer for GC objects.
+
+A different instruction, "gc_store_indexed", offers additional operands,
+which can be mapped to a single MOV instruction using forms like
+`[rax+8*rcx+24]`.
+
+Some other complex instructions pass through to the backend, which must
+deal with them: for example, "card marking" in the GC.  (Writing an
+object pointer inside an array would require walking the whole array
+later to find "young" references. Instead of that, we flip a bit for
+every range of 128 entries.  This is a common GC optimization.)  Setting
+the card bit of a GC object requires a sequence of assembler
+instructions that depends too much on the target CPU to be expressed
+explicitly here (moreover, it contains a few branches, which are hard to
+express at this level).
+
+
+Assembly
+--------
+
+No fancy code generation technique, but greedy forward pass that tries
+to avoid some pitfalls
+
+
+Handling instructions
+~~~~~~~~~~~~~~~~~~~~~
+
+* One by one (forward direction).   Each instruction asks the register
+  allocator to ensure that some arguments are in registers (not in the
+  jitframe); asks for a register to put its result into; and asks for
+  additional scratch registers that will be freed at  the end of the
+  instruction.  There is a special case for boolean variables: they are
+  stored in the condition code flags instead of being materialized as a
+  0/1 value.  (They are materialized later, except in the common case
+  where they are only used by the next `guard_false` or `guard_true` and
+  then forgotten.)
+
+* Instruction arguments are loaded into a register on demand.  This
+  makes the backend quite easy to write, but leads do some bad
+  decisions.
+
+
+Linear scan register allocation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Although it's always a linear trace that we consider, we don't use
+advanced techniques for register allocation: we do forward, on-demand
+allocation as the backend produces the assembler.  When it asks for a
+register to put some value into, we give it any free register, without
+consideration for what will be done with it later.  We compute the
+longevity of all variables, but only use it when choosing which register
+to spill (we spill the variable with the longest longevity).
+
+This works to some extend because it is well integrated with the earlier
+optimization pass. Loops are unrolled once by the optimization pass to
+allow more powerful optimizations---the optimization pass itself is the
+place that benefits the most, but it also has benefits here in the
+assembly pass.  These are:
+
+* The first peeling initializes the register binding on the first use.
+
+* This leads to an already allocated register of the trace loop.
+
+* As well as allocated registers when exiting bridges
+
+[Try to better allocate registers to match the ABI (minor to non benefit
+in the current state)]
+
+
+More complex mappings
+~~~~~~~~~~~~~~~~~~~~~
+
+Some instructions generate more complex code.  These are either or both of:
+
+* complex instructions generating some local control flow, like
+  "cond_gc_wb_call" (for write barriers), "call_assembler" (a call
+  followed by a few checks).
+
+* instructions that invoke custom assembler helpers, like the slow-path
+  of write barriers or the slow-path of allocations.  These slow-paths
+  are typically generated too, so that we are not constrained by the
+  usual calling conventions.
+
+
+GC pointers
+~~~~~~~~~~~
+
+Around most CALL instructions, we need to record a description of where
+the GC pointers are (registers and stack frame).  This is needed in case
+the CALL invokes a garbage collection.  The GC pointers can move; the
+pointers in the registers and stack frame are updated by the GC.  That's
+a reason for why we don't have explicit interior pointers.
+
+GC pointers can appear as constants in the trace.  We are busy changing
+that to use a constant table and `MOV REG, (%RIP+offset)`.  The
+"constant" in the table is actually updated by the GC if the object
+move.
+
+
+Vectorization
+~~~~~~~~~~~~~
+
+Optimization developed to use SIMD instructions for trace loops. Primary
+idea was to use it as an optimization of micro numpy. It has several
+passes on the already optimized trace.
+
+Shortly explained: It builds dependencies for an unrolled trace loop,
+gathering pairs/packs of operations that could be executed in parallel
+and finally schedules the operations.
+
+What did it add to the code base:
+
+* Dependencies can be constructed
+
+* Code motion of guards to relax dependencies
+
+* Scheduler to reorder trace
+
+* Array bound check removal (especially for unrolled traces)
+
+What can it do:
+
+* Transform vector loops (element wise operations)
+
+* Accumulation (`reduce([...],operator,0)`). Requires Operation to be
+  associative and commutative
+
+* SSE 4.1 as "vector backend"
+
+
+We do not
+~~~~~~~~~
+
+* Keep tracing data around to reoptimize the trace tree. (Once a trace
+  is compiled, minimal data is kept.)  This is one reason (there are
+  others in the front-end) for the following result: JIT-compiling a
+  small loop with two common paths ends up as one "loop" and one bridge
+  assembled, and the bridge-following path is slightly less efficient.
+  This is notably because this bridge is assembled with two constraints:
+  the input registers are fixed (from the guard), and the output
+  registers are fixed (from the jump target); usually these two sets of
+  fixed registers are different, and copying around is needed.
+
+* We don't join trace tails: we only assemble *trees*.
+
+* We don't do any reordering (neither of trace instructions nor of
+  individual assembler instructions)
+
+* We don't do any cross-instruction optimization that makes sense only
+  for the backend and can't easily be expressed at a higher level.  I'm
+  sure there are tons of examples of that, but e.g. loading a large
+  constant in a register that will survive for several instructions;
+  moving out of loops *parts* of some instruction like the address
+  calculation; etc. etc.
+
+* Other optimization opportunities I can think about: look at the
+  function prologue/epilogue; look at the overhead (small but not zero)
+  at the start of a bridge.  Also check if the way guards are
+  implemented makes sense.  Also, we generate large-ish sequences of
+  assembler instructions with tons of `Jcond` that are almost never
+  followed; any optimization opportunity there?  (They all go forward,
+  if it changes anything.)  In theory we could also replace some of
+  these with a signal handler on segfault (e.g. `guard_nonnull_class`).
+
+
+a GCC or LLVM backend?
+~~~~~~~~~~~~~~~~~~~~~~
+
+At least for comparison we'd like a JIT backend that emits its code
+using GCC or LLVM (irrespective of the time it would take).  But it's
+hard to map reasonably well the guards to the C language or to LLVM IR.
+The problems are: (1) we have many guards, we would like to avoid having
+many paths that each do a full
+saving-all-local-variables-that-are-still-alive; (2) it's hard to patch
+a guard when a bridge is compiled from it; (3) instructions like a CALL
+need to expose the local variables that are GC pointers; CALL_MAY_FORCE
+need to expose *all* local variables for optional off-line
+reconstruction of the interpreter state.
+
diff --git a/rpython/doc/jit/index.rst b/rpython/doc/jit/index.rst
--- a/rpython/doc/jit/index.rst
+++ b/rpython/doc/jit/index.rst
@@ -26,6 +26,7 @@
    optimizer
    virtualizable
    vectorization
+   backend
 
 - :doc:`Overview <overview>`: motivating our approach
 
@@ -34,5 +35,8 @@
 - :doc:`Optimizer <optimizer>`: the step between tracing and writing
   machine code
 
-- :doc:`Virtulizable <virtualizable>` how virtualizables work and what they are
-  (in other words how to make frames more efficient).
+- :doc:`Virtualizable <virtualizable>`: how virtualizables work and what
+  they are (in other words how to make frames more efficient).
+
+- :doc:`Assembler backend <backend>`: draft notes about the organization
+  of the assembler backends