[Python-checkins] devguide: Move PEP 336 to here.

Tue Jan 18 01:38:25 CET 2011

brett.cannon pushed de7324fef53a to devguide:

http://hg.python.org/devguide/rev/de7324fef53a
changeset:   116:de7324fef53a
user:        Brett Cannon <brett at python.org>
date:        Mon Jan 17 16:35:14 2011 -0800
summary:
  Move PEP 336 to here.

files:
  compiler.rst
  index.rst

diff --git a/compiler.rst b/compiler.rst
new file mode 100644
--- /dev/null
+++ b/compiler.rst
@@ -0,0 +1,562 @@
+.. _compiler:
+
+Design of CPython's Compiler
+============================
+
+
+Abstract
+--------
+
+Historically (through 2.4), compilation from source code to bytecode
+involved two steps:
+
+1. Parse the source code into a parse tree (Parser/pgen.c)
+2. Emit bytecode based on the parse tree (Python/compile.c)
+
+Historically, this is not how a standard compiler works.  The usual
+steps for compilation are:
+
+1. Parse source code into a parse tree (Parser/pgen.c)
+2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
+3. Transform AST into a Control Flow Graph (Python/compile.c)
+4. Emit bytecode based on the Control Flow Graph (Python/compile.c)
+
+Starting with Python 2.5, the above steps are now used.  This change
+was done to simplify compilation by breaking it into three steps.
+The purpose of this document is to outline how the latter three steps
+of the process works.
+
+This document does not touch on how parsing works beyond what is needed
+to explain what is needed for compilation.  It is also not exhaustive
+in terms of the how the entire system works.  You will most likely need
+to read some source to have an exact understanding of all details.
+
+
+Parse Trees
+-----------
+
+Python's parser is an LL(1) parser mostly based off of the
+implementation laid out in the Dragon Book [Aho86]_.
+
+The grammar file for Python can be found in Grammar/Grammar with the
+numeric value of grammar rules are stored in Include/graminit.h.  The
+numeric values for types of tokens (literal tokens, such as ``:``,
+numbers, etc.) are kept in Include/token.h).  The parse tree made up of
+``node *`` structs (as defined in Include/node.h).
+
+Querying data from the node structs can be done with the following
+macros (which are all defined in Include/token.h):
+
+- ``CHILD(node *, int)``
+	Returns the nth child of the node using zero-offset indexing
+- ``RCHILD(node *, int)``
+	Returns the nth child of the node from the right side; use
+	negative numbers!
+- ``NCH(node *)``
+	Number of children the node has
+- ``STR(node *)``
+	String representation of the node; e.g., will return ``:`` for a
+	COLON token
+- ``TYPE(node *)``
+	The type of node as specified in ``Include/graminit.h``
+- ``REQ(node *, TYPE)``
+	Assert that the node is the type that is expected
+- ``LINENO(node *)``
+	retrieve the line number of the source code that led to the
+	creation of the parse rule; defined in Python/ast.c
+
+To tie all of this example, consider the rule for 'while'::
+
+  while_stmt: 'while' test ':' suite ['else' ':' suite]
+
+The node representing this will have ``TYPE(node) == while_stmt`` and
+the number of children can be 4 or 7 depending on if there is an 'else'
+statement.  To access what should be the first ':' and require it be an
+actual ':' token, `(REQ(CHILD(node, 2), COLON)``.
+
+
+Abstract Syntax Trees (AST)
+---------------------------
+
+The abstract syntax tree (AST) is a high-level representation of the
+program structure without the necessity of containing the source code;
+it can be thought of as an abstract representation of the source code.  The
+specification of the AST nodes is specified using the Zephyr Abstract
+Syntax Definition Language (ASDL) [Wang97]_.
+
+The definition of the AST nodes for Python is found in the file
+Parser/Python.asdl .
+
+Each AST node (representing statements, expressions, and several
+specialized types, like list comprehensions and exception handlers) is
+defined by the ASDL.  Most definitions in the AST correspond to a
+particular source construct, such as an 'if' statement or an attribute
+lookup.  The definition is independent of its realization in any
+particular programming language.
+
+The following fragment of the Python ASDL construct demonstrates the
+approach and syntax::
+
+  module Python
+  {
+	stmt = FunctionDef(identifier name, arguments args, stmt* body,
+			    expr* decorators)
+	      | Return(expr? value) | Yield(expr value)
+	      attributes (int lineno)
+  }
+
+The preceding example describes three different kinds of statements;
+function definitions, return statements, and yield statements.  All
+three kinds are considered of type stmt as shown by '|' separating the
+various kinds.  They all take arguments of various kinds and amounts.
+
+Modifiers on the argument type specify the number of values needed; '?'
+means it is optional, '*' means 0 or more, no modifier means only one
+value for the argument and it is required.  FunctionDef, for instance,
+takes an identifier for the name, 'arguments' for args, zero or more
+stmt arguments for 'body', and zero or more expr arguments for
+'decorators'.
+
+Do notice that something like 'arguments', which is a node type, is
+represented as a single AST node and not as a sequence of nodes as with
+stmt as one might expect.  
+
+All three kinds also have an 'attributes' argument; this is shown by the
+fact that 'attributes' lacks a '|' before it.
+
+The statement definitions above generate the following C structure type::
+
+  typedef struct _stmt *stmt_ty;
+
+  struct _stmt {
+        enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind;
+        union {
+                struct {
+                        identifier name;
+                        arguments_ty args;
+                        asdl_seq *body;
+                } FunctionDef;
+                
+                struct {
+                        expr_ty value;
+                } Return;
+                
+                struct {
+                        expr_ty value;
+                } Yield;
+        } v;
+        int lineno;
+   }
+
+Also generated are a series of constructor functions that allocate (in
+this case) a stmt_ty struct with the appropriate initialization.  The
+'kind' field specifies which component of the union is initialized.  The
+FunctionDef() constructor function sets 'kind' to FunctionDef_kind and
+initializes the 'name', 'args', 'body', and 'attributes' fields.
+
+
+Memory Management
+-----------------
+
+Before discussing the actual implementation of the compiler, a discussion of
+how memory is handled is in order.  To make memory management simple, an arena
+is used.  This means that a memory is pooled in a single location for easy
+allocation and removal.  What this gives us is the removal of explicit memory
+deallocation.  Because memory allocation for all needed memory in the compiler
+registers that memory with the arena, a single call to free the arena is all
+that is needed to completely free all memory used by the compiler.
+
+In general, unless you are working on the critical core of the compiler, memory
+management can be completely ignored.  But if you are working at either the
+very beginning of the compiler or the end, you need to care about how the arena
+works.  All code relating to the arena is in either Include/pyarena.h or
+Python/pyarena.c .
+
+PyArena_New() will create a new arena.  The returned PyArena structure will
+store pointers to all memory given to it.  This does the bookkeeping of what
+memory needs to be freed when the compiler is finished with the memory it used.
+That freeing is done with PyArena_Free().  This needs to only be called in
+strategic areas where the compiler exits.
+
+As stated above, in general you should not have to worry about memory
+management when working on the compiler.  The technical details have been
+designed to be hidden from you for most cases.
+
+The only exception comes about when managing a PyObject.  Since the rest
+of Python uses reference counting, there is extra support added
+to the arena to cleanup each PyObject that was allocated.  These cases
+are very rare.  However, if you've allocated a PyObject, you must tell
+the arena about it by calling PyArena_AddPyObject().
+
+
+Parse Tree to AST
+-----------------
+
+The AST is generated from the parse tree (see Python/ast.c) using the
+function ``PyAST_FromNode()``.
+
+The function begins a tree walk of the parse tree, creating various AST
+nodes as it goes along.  It does this by allocating all new nodes it
+needs, calling the proper AST node creation functions for any required
+supporting functions, and connecting them as needed.
+
+Do realize that there is no automated nor symbolic connection between
+the grammar specification and the nodes in the parse tree.  No help is
+directly provided by the parse tree as in yacc.
+
+For instance, one must keep track of which node in the parse tree
+one is working with (e.g., if you are working with an 'if' statement
+you need to watch out for the ':' token to find the end of the conditional).
+
+The functions called to generate AST nodes from the parse tree all have
+the name ast_for_xx where xx is what the grammar rule that the function
+handles (alias_for_import_name is the exception to this).  These in turn
+call the constructor functions as defined by the ASDL grammar and
+contained in Python/Python-ast.c (which was generated by
+Parser/asdl_c.py) to create the nodes of the AST.  This all leads to a
+sequence of AST nodes stored in asdl_seq structs.
+
+
+Function and macros for creating and using ``asdl_seq *`` types as found
+in Python/asdl.c and Include/asdl.h:
+
+- ``asdl_seq_new()``
+	Allocate memory for an asdl_seq for the specified length
+- ``asdl_seq_GET()``
+	Get item held at a specific position in an asdl_seq
+- ``asdl_seq_SET()``
+	Set a specific index in an asdl_seq to the specified value
+- ``asdl_seq_LEN(asdl_seq *)``
+	Return the length of an asdl_seq
+
+If you are working with statements, you must also worry about keeping
+track of what line number generated the statement.  Currently the line
+number is passed as the last parameter to each stmt_ty function.
+
+
+Control Flow Graphs
+-------------------
+
+A control flow graph (often referenced by its acronym, CFG) is a
+directed graph that models the flow of a program using basic blocks that
+contain the intermediate representation (abbreviated "IR", and in this
+case is Python bytecode) within the blocks.  Basic blocks themselves are
+a block of IR that has a single entry point but possibly multiple exit
+points.  The single entry point is the key to basic blocks; it all has
+to do with jumps.  An entry point is the target of something that
+changes control flow (such as a function call or a jump) while exit
+points are instructions that would change the flow of the program (such
+as jumps and 'return' statements).  What this means is that a basic
+block is a chunk of code that starts at the entry point and runs to an
+exit point or the end of the block.
+  
+As an example, consider an 'if' statement with an 'else' block.  The
+guard on the 'if' is a basic block which is pointed to by the basic
+block containing the code leading to the 'if' statement.  The 'if'
+statement block contains jumps (which are exit points) to the true body
+of the 'if' and the 'else' body (which may be NULL), each of which are
+their own basic blocks.  Both of those blocks in turn point to the
+basic block representing the code following the entire 'if' statement.
+
+CFGs are usually one step away from final code output.  Code is directly
+generated from the basic blocks (with jump targets adjusted based on the
+output order) by doing a post-order depth-first search on the CFG
+following the edges.
+
+
+AST to CFG to Bytecode
+----------------------
+
+With the AST created, the next step is to create the CFG. The first step
+is to convert the AST to Python bytecode without having jump targets
+resolved to specific offsets (this is calculated when the CFG goes to
+final bytecode). Essentially, this transforms the AST into Python
+bytecode with control flow represented by the edges of the CFG.
+
+Conversion is done in two passes.  The first creates the namespace
+(variables can be classified as local, free/cell for closures, or
+global).  With that done, the second pass essentially flattens the CFG
+into a list and calculates jump offsets for final output of bytecode.
+
+The conversion process is initiated by a call to the function
+``PyAST_Compile()`` in Python/compile.c .  This function does both the
+conversion of the AST to a CFG and
+outputting final bytecode from the CFG.  The AST to CFG step is handled
+mostly by two functions called by PyAST_Compile(); PySymtable_Build() and
+compiler_mod() .  The former is in Python/symtable.c while the latter is in
+Python/compile.c .
+
+PySymtable_Build() begins by entering the starting code block for the
+AST (passed-in) and then calling the proper symtable_visit_xx function
+(with xx being the AST node type).  Next, the AST tree is walked with
+the various code blocks that delineate the reach of a local variable
+as blocks are entered and exited using symtable_enter_block() and
+symtable_exit_block(), respectively.
+
+Once the symbol table is created, it is time for CFG creation, whose
+code is in Python/compile.c .  This is handled by several functions
+that break the task down by various AST node types.  The functions are
+all named compiler_visit_xx where xx is the name of the node type (such
+as stmt, expr, etc.).  Each function receives a ``struct compiler *``
+and xx_ty where xx is the AST node type.  Typically these functions
+consist of a large 'switch' statement, branching based on the kind of
+node type passed to it.  Simple things are handled inline in the
+'switch' statement with more complex transformations farmed out to other
+functions named compiler_xx with xx being a descriptive name of what is
+being handled.
+
+When transforming an arbitrary AST node, use the VISIT() macro.
+The appropriate compiler_visit_xx function is called, based on the value
+passed in for <node type> (so ``VISIT(c, expr, node)`` calls
+``compiler_visit_expr(c, node)``).  The VISIT_SEQ macro is very similar,
+but is called on AST node sequences (those values that were created as
+arguments to a node that used the '*' modifier).  There is also
+VISIT_SLICE() just for handling slices.
+
+Emission of bytecode is handled by the following macros:
+
+- ``ADDOP()``
+    add a specified opcode
+- ``ADDOP_I()``
+    add an opcode that takes an argument
+- ``ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)``
+    add an opcode with the proper argument based on the position of the
+    specified PyObject in PyObject sequence object, but with no handling of
+    mangled names; used for when you
+    need to do named lookups of objects such as globals, consts, or
+    parameters where name mangling is not possible and the scope of the
+    name is known
+- ``ADDOP_NAME()``
+    just like ADDOP_O, but name mangling is also handled; used for
+    attribute loading or importing based on name
+- ``ADDOP_JABS()``
+    create an absolute jump to a basic block
+- ``ADDOP_JREL()``
+    create a relative jump to a basic block
+
+Several helper functions that will emit bytecode and are named
+compiler_xx() where xx is what the function helps with (list, boolop,
+etc.).  A rather useful one is compiler_nameop().
+This function looks up the scope of a variable and, based on the
+expression context, emits the proper opcode to load, store, or delete
+the variable.
+
+As for handling the line number on which a statement is defined, is
+handled by compiler_visit_stmt() and thus is not a worry.
+
+In addition to emitting bytecode based on the AST node, handling the
+creation of basic blocks must be done.  Below are the macros and
+functions used for managing basic blocks:
+
+- ``NEW_BLOCK()``
+    create block and set it as current
+- ``NEXT_BLOCK()``
+    basically NEW_BLOCK() plus jump from current block
+- ``compiler_new_block()``
+    create a block but don't use it (used for generating jumps)
+
+Once the CFG is created, it must be flattened and then final emission of
+bytecode occurs.  Flattening is handled using a post-order depth-first
+search.  Once flattened, jump offsets are backpatched based on the
+flattening and then a PyCodeObject file is created.  All of this is
+handled by calling assemble() .
+
+
+Introducing New Bytecode
+------------------------
+
+Sometimes a new feature requires a new opcode.  But adding new bytecode is
+not as simple as just suddenly introducing new bytecode in the AST ->
+bytecode step of the compiler.  Several pieces of code throughout Python depend
+on having correct information about what bytecode exists.
+
+First, you must choose a name and a unique identifier number.  The official
+list of bytecode can be found in Include/opcode.h .  If the opcode is to take
+an argument, it must be given a unique number greater than that assigned to
+``HAVE_ARGUMENT`` (as found in Include/opcode.h).
+
+Once the name/number pair
+has been chosen and entered in Include/opcode.h, you must also enter it into
+Lib/opcode.py and Doc/library/dis.rst .
+
+With a new bytecode you must also change what is called the magic number for
+.pyc files.  The variable ``MAGIC`` in Python/import.c contains the number.
+Changing this number will lead to all .pyc files with the old MAGIC
+to be recompiled by the interpreter on import.
+
+Finally, you need to introduce the use of the new bytecode.  Altering
+Python/compile.c and Python/ceval.c will be the primary places to change.
+But you will also need to change the 'compiler' package.  The key files
+to do that are Lib/compiler/pyassem.py and Lib/compiler/pycodegen.py .
+
+If you make a change here that can affect the output of bytecode that
+is already in existence and you do not change the magic number constantly, make
+sure to delete your old .py(c|o) files!  Even though you will end up changing
+the magic number if you change the bytecode, while you are debugging your work
+you will be changing the bytecode output without constantly bumping up the
+magic number.  This means you end up with stale .pyc files that will not be
+recreated.  Running
+``find . -name '*.py[co]' -exec rm -f {} ';'`` should delete all .pyc files you
+have, forcing new ones to be created and thus allow you test out your new
+bytecode properly.
+
+
+Code Objects
+------------
+
+The result of ``PyAST_Compile()`` is a PyCodeObject which is defined in
+Include/code.h .  And with that you now have executable Python bytecode!
+
+The code objects (byte code) is executed in Python/ceval.c .  This file
+will also need a new case statement for the new opcode in the big switch
+statement in PyEval_EvalFrameEx().
+
+
+Important Files
+---------------
+
++ Parser/
+
+    - Python.asdl
+        ASDL syntax file
+
+    - asdl.py
+        "An implementation of the Zephyr Abstract Syntax Definition
+        Language."  Uses SPARK_ to parse the ASDL files.
+
+    - asdl_c.py
+        "Generate C code from an ASDL description."  Generates
+	Python/Python-ast.c and Include/Python-ast.h .
+
+    - spark.py
+        SPARK_ parser generator
+
++ Python/
+
+    - Python-ast.c
+        Creates C structs corresponding to the ASDL types.  Also
+	contains code for marshaling AST nodes (core ASDL types have
+	marshaling code in asdl.c).  "File automatically generated by
+	Parser/asdl_c.py".  This file must be committed separately
+        after every grammar change is committed since the __version__
+        value is set to the latest grammar change revision number.
+
+    - asdl.c
+        Contains code to handle the ASDL sequence type.  Also has code
+        to handle marshalling the core ASDL types, such as number and
+        identifier.  used by Python-ast.c for marshaling AST nodes.
+
+    - ast.c
+        Converts Python's parse tree into the abstract syntax tree.
+
+    - ceval.c
+        Executes byte code (aka, eval loop).
+
+    - compile.c
+        Emits bytecode based on the AST.
+
+    - symtable.c
+	Generates a symbol table from AST.
+
+    - pyarena.c
+        Implementation of the arena memory manager.
+
+    - import.c
+        Home of the magic number (named ``MAGIC``) for bytecode versioning
+	
+
++ Include/
+
+    - Python-ast.h
+        Contains the actual definitions of the C structs as generated by
+        Python/Python-ast.c .
+        "Automatically generated by Parser/asdl_c.py".
+
+    - asdl.h
+        Header for the corresponding Python/ast.c .
+
+    - ast.h
+        Declares PyAST_FromNode() external (from Python/ast.c).
+
+    - code.h
+	Header file for Objects/codeobject.c; contains definition of
+	PyCodeObject.
+
+    - symtable.h
+	Header for Python/symtable.c .  struct symtable and
+	PySTEntryObject are defined here.
+
+    - pyarena.h
+        Header file for the corresponding Python/pyarena.c .
+
+    - opcode.h
+        Master list of bytecode; if this file is modified you must modify
+        several other files accordingly (see "`Introducing New Bytecode`_")
+
++ Objects/
+
+    - codeobject.c
+	Contains PyCodeObject-related code (originally in
+	Python/compile.c).
+
++ Lib/
+
+    - opcode.py
+        One of the files that must be modified if Include/opcode.h is.
+
+    - compiler/
+
+        * pyassem.py
+            One of the files that must be modified if Include/opcode.h is
+            changed.
+
+        * pycodegen.py
+            One of the files that must be modified if Include/opcode.h is
+            changed.
+
+
+Known Compiler-related Experiments
+----------------------------------
+
+This section lists known experiments involving the compiler (including
+bytecode).
+
+Skip Montanaro presented a paper at a Python workshop on a peephole optimizer
+[#skip-peephole]_.
+
+Michael Hudson has a non-active SourceForge project named Bytecodehacks
+[#Bytecodehacks]_ that provides functionality for playing with bytecode
+directly.
+
+An opcode to combine the functionality of LOAD_ATTR/CALL_FUNCTION was created
+named CALL_ATTR [#CALL_ATTR]_.  Currently only works for classic classes and
+for new-style classes rough benchmarking showed an actual slowdown thanks to
+having to support both classic and new-style classes.
+
+
+
+References
+----------
+
+.. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman.
+   `Compilers: Principles, Techniques, and Tools`,
+   http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108
+
+.. [Wang97]  Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris
+   S. Serra.  `The Zephyr Abstract Syntax Description Language.`_
+   In Proceedings of the Conference on Domain-Specific Languages, pp.
+   213--227, 1997.
+
+.. _The Zephyr Abstract Syntax Description Language.:
+   http://www.cs.princeton.edu/~danwang/Papers/dsl97/dsl97.html
+
+.. _SPARK: http://pages.cpsc.ucalgary.ca/~aycock/spark/
+
+.. [#skip-peephole] Skip Montanaro's Peephole Optimizer Paper
+   (http://www.foretec.com/python/workshops/1998-11/proceedings/papers/montanaro/montanaro.html)
+
+.. [#Bytecodehacks] Bytecodehacks Project
+   (http://bytecodehacks.sourceforge.net/bch-docs/bch/index.html)
+
+.. [#CALL_ATTR] CALL_ATTR opcode
+   (http://www.python.org/sf/709744)
diff --git a/index.rst b/index.rst
--- a/index.rst
+++ b/index.rst
@@ -22,6 +22,7 @@
    langchanges
 
    grammar
+   compiler
 
 
 .. todolist::
@@ -78,6 +79,7 @@
     * `Daily snapshot <http://svn.python.org/snapshots/>`_
 * Help with ...
     * :ref:`grammar`
+    * :ref:`compiler`
 
 
 .. _buildbots: http://python.org/dev/buildbot/

--
Repository URL: http://hg.python.org/devguide