[pypy-commit] pypy release-2.5.x: merge default into branch

Wed Mar 18 11:29:03 CET 2015

Author: mattip <matti.picus at gmail.com>
Branch: release-2.5.x
Changeset: r76453:e3d046c43451
Date: 2015-03-18 12:28 +0200
http://bitbucket.org/pypy/pypy/changeset/e3d046c43451/

Log:	merge default into branch

diff --git a/.hgtags b/.hgtags
--- a/.hgtags
+++ b/.hgtags
@@ -11,3 +11,4 @@
 32f35069a16d819b58c1b6efb17c44e3e53397b2 release-2.2=3.1
 32f35069a16d819b58c1b6efb17c44e3e53397b2 release-2.3.1
 10f1b29a2bd21f837090286174a9ca030b8680b2 release-2.5.0
+8e24dac0b8e2db30d46d59f2c4daa3d4aaab7861 release-2.5.1
diff --git a/pypy/doc/stm.rst b/pypy/doc/stm.rst
--- a/pypy/doc/stm.rst
+++ b/pypy/doc/stm.rst
@@ -46,13 +46,14 @@
   multiple cores.
 
 * ``pypy-stm`` provides (but does not impose) a special API to the
-  user in the pure Python module `transaction`_.  This module is based
-  on the lower-level module `pypystm`_, but also provides some
+  user in the pure Python module ``transaction``.  This module is based
+  on the lower-level module ``pypystm``, but also provides some
   compatibily with non-STM PyPy's or CPython's.
 
 * Building on top of the way the GIL is removed, we will talk
-  about `Atomic sections, Transactions, etc.: a better way to write
-  parallel programs`_.
+  about `How to write multithreaded programs: the 10'000-feet view`_
+  and `transaction.TransactionQueue`_.
+
 
 
 Getting Started
@@ -89,7 +90,7 @@
 Current status (stmgc-c7)
 -------------------------
 
-* It seems to work fine, without crashing any more.  Please `report
+* **NEW:** It seems to work fine, without crashing any more.  Please `report
   any crash`_ you find (or other bugs).
 
 * It runs with an overhead as low as 20% on examples like "richards".
@@ -97,33 +98,47 @@
   2x for "translate.py"-- which we are still trying to understand.
   One suspect is our partial GC implementation, see below.
 
+* **NEW:** the ``PYPYSTM`` environment variable and the
+  ``pypy/stm/print_stm_log.py`` script let you know exactly which
+  "conflicts" occurred.  This is described in the section
+  `transaction.TransactionQueue`_ below.
+
+* **NEW:** special transaction-friendly APIs (like ``stmdict``),
+  described in the section `transaction.TransactionQueue`_ below.  The
+  old API changed again, mostly moving to different modules.  Sorry
+  about that.  I feel it's a better idea to change the API early
+  instead of being stuck with a bad one later...
+
 * Currently limited to 1.5 GB of RAM (this is just a parameter in
   `core.h`__ -- theoretically.  In practice, increase it too much and
   clang crashes again).  Memory overflows are not correctly handled;
   they cause segfaults.
 
-* The JIT warm-up time improved recently but is still bad.  In order to
-  produce machine code, the JIT needs to enter a special single-threaded
-  mode for now.  This means that you will get bad performance results if
-  your program doesn't run for several seconds, where *several* can mean
-  *many.*  When trying benchmarks, be sure to check that you have
-  reached the warmed state, i.e. the performance is not improving any
-  more.  This should be clear from the fact that as long as it's
-  producing more machine code, ``pypy-stm`` will run on a single core.
+* **NEW:** The JIT warm-up time improved again, but is still
+  relatively large.  In order to produce machine code, the JIT needs
+  to enter "inevitable" mode.  This means that you will get bad
+  performance results if your program doesn't run for several seconds,
+  where *several* can mean *many.* When trying benchmarks, be sure to
+  check that you have reached the warmed state, i.e. the performance
+  is not improving any more.
 
 * The GC is new; although clearly inspired by PyPy's regular GC, it
   misses a number of optimizations for now.  Programs allocating large
   numbers of small objects that don't immediately die (surely a common
-  situation) suffer from these missing optimizations.
+  situation) suffer from these missing optimizations.  (The bleeding
+  edge ``stmgc-c8`` is better at that.)
 
 * Weakrefs might appear to work a bit strangely for now, sometimes
   staying alive throught ``gc.collect()``, or even dying but then
-  un-dying for a short time before dying again.
+  un-dying for a short time before dying again.  A similar problem can
+  show up occasionally elsewhere with accesses to some external
+  resources, where the (apparent) serialized order doesn't match the
+  underlying (multithreading) order.  These are bugs (partially fixed
+  already in ``stmgc-c8``).
 
 * The STM system is based on very efficient read/write barriers, which
   are mostly done (their placement could be improved a bit in
-  JIT-generated machine code).  But the overall bookkeeping logic could
-  see more improvements (see `Low-level statistics`_ below).
+  JIT-generated machine code).
 
 * Forking the process is slow because the complete memory needs to be
   copied manually.  A warning is printed to this effect.
@@ -132,7 +147,8 @@
   crash on an assertion error because of a non-implemented overflow of
   an internal 28-bit counter.
 
-.. _`report bugs`: https://bugs.pypy.org/
+
+.. _`report any crash`: https://bitbucket.org/pypy/pypy/issues?status=new&status=open
 .. __: https://bitbucket.org/pypy/pypy/raw/stmgc-c7/rpython/translator/stm/src_stm/stm/core.h
 
 
@@ -155,7 +171,6 @@
 interpreter and other ones might have slightly different needs.
 
 
-
 User Guide
 ==========
 
@@ -181,8 +196,33 @@
 order.
 
 
-A better way to write parallel programs
----------------------------------------
+How to write multithreaded programs: the 10'000-feet view
+---------------------------------------------------------
+
+PyPy-STM offers two ways to write multithreaded programs:
+
+* the traditional way, using the ``thread`` or ``threading`` modules.
+
+* using ``TransactionQueue``, described next__, as a way to hide the
+  low-level notion of threads.
+
+.. __: `transaction.TransactionQueue`_
+
+``TransactionQueue`` hides the hard multithreading-related issues that
+we typically encounter when using low-level threads.  This is not the
+first alternative approach to avoid dealing with low-level threads;
+for example, OpenMP_ is one.  However, it is one of the first ones
+which does not require the code to be organized in a particular
+fashion.  Instead, it works on any Python program which has got
+*latent* and *imperfect* parallelism.  Ideally, it only requires that
+the end programmer identifies where this parallelism is likely to be
+found, and communicates it to the system using a simple API.
+
+.. _OpenMP: http://en.wikipedia.org/wiki/OpenMP
+
+
+transaction.TransactionQueue
+----------------------------
 
 In CPU-hungry programs, we can often easily identify outermost loops
 over some data structure, or other repetitive algorithm, where each
@@ -216,41 +256,115 @@
 behavior did not change because we are using ``TransactionQueue``.
 All the calls still *appear* to execute in some serial order.
 
-Now the performance should ideally be improved: if the function calls
-turn out to be actually independent (most of the time), then it will
-be.  But if the function calls are not, then the total performance
-will crawl back to the previous case, with additionally some small
-penalty for the overhead.
+However, the performance typically does not increase out of the box.
+In fact, it is likely to be worse at first.  Typically, this is
+indicated by the total CPU usage, which remains low (closer to 1 than
+N cores).  First note that it is expected that the CPU usage should
+not go much higher than 1 in the JIT warm-up phase: you must run a
+program for several seconds, or for larger programs at least one
+minute, to give the JIT a chance to warm up enough.  But if CPU usage
+remains low even afterwards, then the ``PYPYSTM`` environment variable
+can be used to track what is going on.
 
-This case occurs typically when you see the total CPU usage remaining
-low (closer to 1 than N cores).  Note first that it is expected that
-the CPU usage should not go much higher than 1 in the JIT warm-up
-phase.  You must run a program for several seconds, or for larger
-programs at least one minute, to give the JIT a chance to warm up
-correctly.  But if CPU usage remains low even though all code is
-executing in a ``TransactionQueue.run()``, then the ``PYPYSTM``
-environment variable can be used to track what is going on.
+Run your program with ``PYPYSTM=logfile`` to produce a log file called
+``logfile``.  Afterwards, use the ``pypy/stm/print_stm_log.py``
+utility to inspect the content of this log file.  It produces output
+like this (sorted by amount of time lost, largest first)::
 
-Run your program with ``PYPYSTM=stmlog`` to produce a log file called
-``stmlog``.  Afterwards, use the ``pypy/stm/print_stm_log.py`` utility
-to inspect the content of this log file.  It produces output like
-this::
+    10.5s lost in aborts, 1.25s paused (12412x STM_CONTENTION_WRITE_WRITE)
+    File "foo.py", line 10, in f
+      someobj.stuff = 5
+    File "bar.py", line 20, in g
+      someobj.other = 10
 
-    documentation in progress!
+This means that 10.5 seconds were lost running transactions that were
+aborted (which caused another 1.25 seconds of lost time by pausing),
+because of the reason shown in the two independent single-entry
+tracebacks: one thread ran the line ``someobj.stuff = 5``, whereas
+another thread concurrently ran the line ``someobj.other = 10`` on the
+same object.  Two writes to the same object cause a conflict, which
+aborts one of the two transactions.  In the example above this
+occurred 12412 times.
 
+The two other conflict sources are ``STM_CONTENTION_INEVITABLE``,
+which means that two transactions both tried to do an external
+operation, like printing or reading from a socket or accessing an
+external array of raw data; and ``STM_CONTENTION_WRITE_READ``, which
+means that one transaction wrote to an object but the other one merely
+read it, not wrote to it (in that case only the writing transaction is
+reported; the location for the reads is not recorded because doing so
+is not possible without a very large performance impact).
+
+Common causes of conflicts:
+
+* First of all, any I/O or raw manipulation of memory turns the
+  transaction inevitable ("must not abort").  There can be only one
+  inevitable transaction running at any time.  A common case is if
+  each transaction starts with sending data to a log file.  You should
+  refactor this case so that it occurs either near the end of the
+  transaction (which can then mostly run in non-inevitable mode), or
+  even delegate it to a separate thread.
+
+* Writing to a list or a dictionary conflicts with any read from the
+  same list or dictionary, even one done with a different key.  For
+  dictionaries and sets, you can try the types ``transaction.stmdict``
+  and ``transaction.stmset``, which behave mostly like ``dict`` and
+  ``set`` but allow concurrent access to different keys.  (What is
+  missing from them so far is lazy iteration: for example,
+  ``stmdict.iterkeys()`` is implemented as ``iter(stmdict.keys())``;
+  and, unlike PyPy's dictionaries and sets, the STM versions are not
+  ordered.)  There are also experimental ``stmiddict`` and
+  ``stmidset`` classes using the identity of the key.
+
+* ``time.time()`` and ``time.clock()`` turn the transaction inevitable
+  in order to guarantee that a call that appears to be later will
+  really return a higher number.  If getting slightly unordered
+  results is fine, use ``transaction.time()`` or
+  ``transaction.clock()``.
+
+* ``transaction.threadlocalproperty`` can be used as class-level::
+
+      class Foo(object):     # must be a new-style class!
+          x = transaction.threadlocalproperty()
+          y = transaction.threadlocalproperty(dict)
+
+  This declares that instances of ``Foo`` have two attributes ``x``
+  and ``y`` that are thread-local: reading or writing them from
+  concurrently-running transactions will return independent results.
+  (Any other attributes of ``Foo`` instances will be globally visible
+  from all threads, as usual.)  The optional argument to
+  ``threadlocalproperty()`` is the default value factory: in case no
+  value was assigned in the current thread yet, the factory is called
+  and its result becomes the value in that thread (like
+  ``collections.defaultdict``).  If no default value factory is
+  specified, uninitialized reads raise ``AttributeError``.  Note that
+  with ``TransactionQueue`` you get a pool of a fixed number of
+  threads, each running the transactions one after the other; such
+  thread-local properties will have the value last stored in them in
+  the same thread,, which may come from a random previous transaction.
+  ``threadlocalproperty`` is still useful to avoid conflicts from
+  cache-like data structures.
+
+Note that Python is a complicated language; there are a number of less
+common cases that may cause conflict (of any type) where we might not
+expect it at priori.  In many of these cases it could be fixed; please
+report any case that you don't understand.  (For example, so far,
+creating a weakref to an object requires attaching an auxiliary
+internal object to that object, and so it can cause write-write
+conflicts.)
 
 
 Atomic sections
 ---------------
 
-PyPy supports *atomic sections,* which are blocks of code which you
-want to execute without "releasing the GIL".  In STM terms, this means
-blocks of code that are executed while guaranteeing that the
-transaction is not interrupted in the middle.  *This is experimental
-and may be removed in the future* if `lock elision`_ is ever
-implemented.
+The ``TransactionQueue`` class described above is based on *atomic
+sections,* which are blocks of code which you want to execute without
+"releasing the GIL".  In STM terms, this means blocks of code that are
+executed while guaranteeing that the transaction is not interrupted in
+the middle.  *This is experimental and may be removed in the future*
+if `Software lock elision`_ is ever implemented.
 
-Here is a usage example::
+Here is a direct usage example::
 
     with transaction.atomic:
         assert len(lst1) == 10
@@ -295,7 +409,8 @@
 including with a ``print`` to standard output.  If one thread tries to
 acquire a lock while running in an atomic block, and another thread
 has got the same lock at that point, then the former may fail with a
-``thread.error``.  The reason is that "waiting" for some condition to
+``thread.error``.  (Don't rely on it; it may also deadlock.)
+The reason is that "waiting" for some condition to
 become true --while running in an atomic block-- does not really make
 sense.  For now you can work around it by making sure that, say, all
 your prints are either in an ``atomic`` block or none of them are.
@@ -354,106 +469,38 @@
 .. _`software lock elision`: https://www.repository.cam.ac.uk/handle/1810/239410
 
 
-Atomic sections, Transactions, etc.: a better way to write parallel programs
-----------------------------------------------------------------------------
+Miscellaneous functions
+-----------------------
 
-(This section is based on locks as we plan to implement them, but also
-works with the existing atomic sections.)
-
-In the cases where elision works, the block of code can run in parallel
-with other blocks of code *even if they are protected by the same lock.*
-You still get the illusion that the blocks are run sequentially.  This
-works even for multiple threads that run each a series of such blocks
-and nothing else, protected by one single global lock.  This is
-basically the Python application-level equivalent of what was done with
-the interpreter in ``pypy-stm``: while you think you are writing
-thread-unfriendly code because of this global lock, actually the
-underlying system is able to make it run on multiple cores anyway.
-
-This capability can be hidden in a library or in the framework you use;
-the end user's code does not need to be explicitly aware of using
-threads.  For a simple example of this, there is `transaction.py`_ in
-``lib_pypy``.  The idea is that you write, or already have, some program
-where the function ``f(key, value)`` runs on every item of some big
-dictionary, say::
-
-    for key, value in bigdict.items():
-        f(key, value)
-
-Then you simply replace the loop with::
-
-    for key, value in bigdict.items():
-        transaction.add(f, key, value)
-    transaction.run()
-
-This code runs the various calls to ``f(key, value)`` using a thread
-pool, but every single call is executed under the protection of a unique
-lock.  The end result is that the behavior is exactly equivalent --- in
-fact it makes little sense to do it in this way on a non-STM PyPy or on
-CPython.  But on ``pypy-stm``, the various locked calls to ``f(key,
-value)`` can tentatively be executed in parallel, even if the observable
-result is as if they were executed in some serial order.
-
-This approach hides the notion of threads from the end programmer,
-including all the hard multithreading-related issues.  This is not the
-first alternative approach to explicit threads; for example, OpenMP_ is
-one.  However, it is one of the first ones which does not require the
-code to be organized in a particular fashion.  Instead, it works on any
-Python program which has got latent, imperfect parallelism.  Ideally, it
-only requires that the end programmer identifies where this parallelism
-is likely to be found, and communicates it to the system, using for
-example the ``transaction.add()`` scheme.
-
-.. _`transaction.py`: https://bitbucket.org/pypy/pypy/raw/stmgc-c7/lib_pypy/transaction.py
-.. _OpenMP: http://en.wikipedia.org/wiki/OpenMP
-
-
-.. _`transactional_memory`:
-
-API of transactional_memory
----------------------------
-
-The new pure Python module ``transactional_memory`` runs on both CPython
-and PyPy, both with and without STM.  It contains:
-
-* ``getsegmentlimit()``: return the number of "segments" in
+* ``transaction.getsegmentlimit()``: return the number of "segments" in
   this pypy-stm.  This is the limit above which more threads will not be
   able to execute on more cores.  (Right now it is limited to 4 due to
   inter-segment overhead, but should be increased in the future.  It
   should also be settable, and the default value should depend on the
   number of actual CPUs.)  If STM is not available, this returns 1.
 
-* ``print_abort_info(minimum_time=0.0)``: debugging help.  Each thread
-  remembers the longest abort or pause it did because of cross-thread
-  contention_.  This function prints it to ``stderr`` if the time lost
-  is greater than ``minimum_time`` seconds.  The record is then
-  cleared, to make it ready for new events.  This function returns
-  ``True`` if it printed a report, and ``False`` otherwise.
+* ``__pypy__.thread.signals_enabled``: a context manager that runs its
+  block of code with signals enabled.  By default, signals are only
+  enabled in the main thread; a non-main thread will not receive
+  signals (this is like CPython).  Enabling signals in non-main
+  threads is useful for libraries where threads are hidden and the end
+  user is not expecting his code to run elsewhere than in the main
+  thread.
 
+* ``pypystm.exclusive_atomic``: a context manager similar to
+  ``transaction.atomic`` but which complains if it is nested.
 
-API of __pypy__.thread
-----------------------
+* ``transaction.is_atomic()``: return True if called from an atomic
+  context.
 
-The ``__pypy__.thread`` submodule is a built-in module of PyPy that
-contains a few internal built-in functions used by the
-``transactional_memory`` module, plus the following:
+* ``pypystm.count()``: return a different positive integer every time
+  it is called.  This works without generating conflicts.  The
+  returned integers are only roughly in increasing order; this should
+  not be relied upon.
 
-* ``__pypy__.thread.atomic``: a context manager to run a block in
-  fully atomic mode, without "releasing the GIL".  (May be eventually
-  removed?)
 
-* ``__pypy__.thread.signals_enabled``: a context manager that runs its
-  block with signals enabled.  By default, signals are only enabled in
-  the main thread; a non-main thread will not receive signals (this is
-  like CPython).  Enabling signals in non-main threads is useful for
-  libraries where threads are hidden and the end user is not expecting
-  his code to run elsewhere than in the main thread.
-
-
-.. _contention:
-
-Conflicts
----------
+More details about conflicts
+----------------------------
 
 Based on Software Transactional Memory, the ``pypy-stm`` solution is
 prone to "conflicts".  To repeat the basic idea, threads execute their code
@@ -469,7 +516,7 @@
 the transaction).  If this occurs too often, parallelization fails.
 
 How much actual parallelization a multithreaded program can see is a bit
-subtle.  Basically, a program not using ``__pypy__.thread.atomic`` or
+subtle.  Basically, a program not using ``transaction.atomic`` or
 eliding locks, or doing so for very short amounts of time, will
 parallelize almost freely (as long as it's not some artificial example
 where, say, all threads try to increase the same global counter and do
@@ -481,13 +528,14 @@
 overview.
 
 Parallelization works as long as two principles are respected.  The
-first one is that the transactions must not *conflict* with each other.
-The most obvious sources of conflicts are threads that all increment a
-global shared counter, or that all store the result of their
-computations into the same list --- or, more subtly, that all ``pop()``
-the work to do from the same list, because that is also a mutation of
-the list.  (It is expected that some STM-aware library will eventually
-be designed to help with conflict problems, like a STM-aware queue.)
+first one is that the transactions must not *conflict* with each
+other.  The most obvious sources of conflicts are threads that all
+increment a global shared counter, or that all store the result of
+their computations into the same list --- or, more subtly, that all
+``pop()`` the work to do from the same list, because that is also a
+mutation of the list.  (You can work around it with
+``transaction.stmdict``, but for that specific example, some STM-aware
+queue should eventually be designed.)
 
 A conflict occurs as follows: when a transaction commits (i.e. finishes
 successfully) it may cause other transactions that are still in progress
@@ -503,22 +551,23 @@
 Another issue is that of avoiding long-running so-called "inevitable"
 transactions ("inevitable" is taken in the sense of "which cannot be
 avoided", i.e. transactions which cannot abort any more).  Transactions
-like that should only occur if you use ``__pypy__.thread.atomic``,
-generally become of I/O in atomic blocks.  They work, but the
+like that should only occur if you use ``atomic``,
+generally because of I/O in atomic blocks.  They work, but the
 transaction is turned inevitable before the I/O is performed.  For all
 the remaining execution time of the atomic block, they will impede
 parallel work.  The best is to organize the code so that such operations
-are done completely outside ``__pypy__.thread.atomic``.
+are done completely outside ``atomic``.
 
-(This is related to the fact that blocking I/O operations are
+(This is not unrelated to the fact that blocking I/O operations are
 discouraged with Twisted, and if you really need them, you should do
 them on their own separate thread.)
 
-In case of lock elision, we don't get long-running inevitable
-transactions, but a different problem can occur: doing I/O cancels lock
-elision, and the lock turns into a real lock, preventing other threads
-from committing if they also need this lock.  (More about it when lock
-elision is implemented and tested.)
+In case lock elision eventually replaces atomic sections, we wouldn't
+get long-running inevitable transactions, but the same problem occurs
+in a different way: doing I/O cancels lock elision, and the lock turns
+into a real lock.  This prevents other threads from committing if they
+also need this lock.  (More about it when lock elision is implemented
+and tested.)
 
 
 
@@ -528,56 +577,18 @@
 XXX this section mostly empty for now
 
 
-Low-level statistics
---------------------
-
-When a non-main thread finishes, you get low-level statistics printed to
-stderr, looking like that::
-
-      thread 0x7f73377fe600:
-          outside transaction          42182    0.506 s
-          run current                  85466    0.000 s
-          run committed                34262    3.178 s
-          run aborted write write       6982    0.083 s
-          run aborted write read         550    0.005 s
-          run aborted inevitable         388    0.010 s
-          run aborted other                0    0.000 s
-          wait free segment                0    0.000 s
-          wait write read                 78    0.027 s
-          wait inevitable                887    0.490 s
-          wait other                       0    0.000 s
-          sync commit soon                 1    0.000 s
-          bookkeeping                  51418    0.606 s
-          minor gc                    162970    1.135 s
-          major gc                         1    0.019 s
-          sync pause                   59173    1.738 s
-          longest recordered marker          0.000826 s
-          "File "x.py", line 5, in f"
-
-On each line, the first number is a counter, and the second number gives
-the associated time --- the amount of real time that the thread was in
-this state.  The sum of all the times should be equal to the total time
-between the thread's start and the thread's end.  The most important
-points are "run committed", which gives the amount of useful work, and
-"outside transaction", which should give the time spent e.g. in library
-calls (right now it seems to be larger than that; to investigate).  The
-various "run aborted" and "wait" entries are time lost due to
-conflicts_.  Everything else is overhead of various forms.  (Short-,
-medium- and long-term future work involves reducing this overhead :-)
-
-The last two lines are special; they are an internal marker read by
-``transactional_memory.print_abort_info()``.
-
-
 Reference to implementation details
 -----------------------------------
 
-The core of the implementation is in a separate C library called stmgc_,
-in the c7_ subdirectory.  Please see the `README.txt`_ for more
-information.  In particular, the notion of segment is discussed there.
+The core of the implementation is in a separate C library called
+stmgc_, in the c7_ subdirectory (current version of pypy-stm) and in
+the c8_ subdirectory (bleeding edge version).  Please see the
+`README.txt`_ for more information.  In particular, the notion of
+segment is discussed there.
 
 .. _stmgc: https://bitbucket.org/pypy/stmgc/src/default/
 .. _c7: https://bitbucket.org/pypy/stmgc/src/default/c7/
+.. _c8: https://bitbucket.org/pypy/stmgc/src/default/c8/
 .. _`README.txt`: https://bitbucket.org/pypy/stmgc/raw/default/c7/README.txt
 
 PyPy itself adds on top of it the automatic placement of read__ and write__
diff --git a/pypy/goal/getnightly.py b/pypy/goal/getnightly.py
--- a/pypy/goal/getnightly.py
+++ b/pypy/goal/getnightly.py
@@ -7,7 +7,7 @@
 if sys.platform.startswith('linux'):
     arch = 'linux'
     cmd = 'wget "%s"'
-    tar = "tar -x -v --wildcards --strip-components=2 -f %s '*/bin/pypy'"
+    tar = "tar -x -v --wildcards --strip-components=2 -f %s '*/bin/pypy' '*/bin/libpypy-c.so'"
     if os.uname()[-1].startswith('arm'):
         arch += '-armhf-raspbian'
 elif sys.platform.startswith('darwin'):
diff --git a/pypy/interpreter/pycode.py b/pypy/interpreter/pycode.py
--- a/pypy/interpreter/pycode.py
+++ b/pypy/interpreter/pycode.py
@@ -136,7 +136,9 @@
             filename = filename[:-1]
         basename = os.path.basename(filename)
         lastdirname = os.path.basename(os.path.dirname(filename))
-        self.co_filename = '<builtin>/%s/%s' % (lastdirname, basename)
+        if lastdirname:
+            basename = '%s/%s' % (lastdirname, basename)
+        self.co_filename = '<builtin>/%s' % (basename,)
 
     co_names = property(lambda self: [self.space.unwrap(w_name) for w_name in self.co_names_w]) # for trace
 
diff --git a/pypy/interpreter/pyopcode.py b/pypy/interpreter/pyopcode.py
--- a/pypy/interpreter/pyopcode.py
+++ b/pypy/interpreter/pyopcode.py
@@ -1619,6 +1619,13 @@
     def prepare_exec(f, prog, globals, locals, compile_flags, builtin, codetype):
         """Manipulate parameters to exec statement to (codeobject, dict, dict).
         """
+        if (globals is None and locals is None and
+            isinstance(prog, tuple) and
+            (len(prog) == 2 or len(prog) == 3)):
+            globals = prog[1]
+            if len(prog) == 3:
+                locals = prog[2]
+            prog = prog[0]
         if globals is None:
             globals = f.f_globals
             if locals is None:
diff --git a/pypy/interpreter/test/test_exec.py b/pypy/interpreter/test/test_exec.py
--- a/pypy/interpreter/test/test_exec.py
+++ b/pypy/interpreter/test/test_exec.py
@@ -262,3 +262,11 @@
         """]
         for c in code:
             compile(c, "<code>", "exec")
+
+    def test_exec_tuple(self):
+        # note: this is VERY different than testing exec("a = 42", d), because
+        # this specific case is handled specially by the AST compiler
+        d = {}
+        x = ("a = 42", d)
+        exec x
+        assert d['a'] == 42
diff --git a/pypy/module/_socket/interp_socket.py b/pypy/module/_socket/interp_socket.py
--- a/pypy/module/_socket/interp_socket.py
+++ b/pypy/module/_socket/interp_socket.py
@@ -30,7 +30,7 @@
                                space.wrap(addr.get_protocol()),
                                space.wrap(addr.get_pkttype()),
                                space.wrap(addr.get_hatype()),
-                               space.wrap(addr.get_addr())])
+                               space.wrap(addr.get_haddr())])
     elif rsocket.HAS_AF_UNIX and isinstance(addr, rsocket.UNIXAddress):
         return space.wrap(addr.get_path())
     elif rsocket.HAS_AF_NETLINK and isinstance(addr, rsocket.NETLINKAddress):
@@ -79,7 +79,7 @@
         raise NotImplementedError
 
 # XXX Hack to seperate rpython and pypy
-def addr_from_object(family, space, w_address):
+def addr_from_object(family, fd, space, w_address):
     if family == rsocket.AF_INET:
         w_host, w_port = space.unpackiterable(w_address, 2)
         host = space.str_w(w_host)
@@ -89,8 +89,9 @@
     if family == rsocket.AF_INET6:
         pieces_w = space.unpackiterable(w_address)
         if not (2 <= len(pieces_w) <= 4):
-            raise TypeError("AF_INET6 address must be a tuple of length 2 "
-                               "to 4, not %d" % len(pieces_w))
+            raise oefmt(space.w_TypeError,
+                        "AF_INET6 address must be a tuple of length 2 "
+                        "to 4, not %d", len(pieces_w))
         host = space.str_w(pieces_w[0])
         port = space.int_w(pieces_w[1])
         port = make_ushort_port(space, port)
@@ -105,6 +106,28 @@
     if rsocket.HAS_AF_NETLINK and family == rsocket.AF_NETLINK:
         w_pid, w_groups = space.unpackiterable(w_address, 2)
         return rsocket.NETLINKAddress(space.uint_w(w_pid), space.uint_w(w_groups))
+    if rsocket.HAS_AF_PACKET and family == rsocket.AF_PACKET:
+        pieces_w = space.unpackiterable(w_address)
+        if not (2 <= len(pieces_w) <= 5):
+            raise oefmt(space.w_TypeError,
+                        "AF_PACKET address must be a tuple of length 2 "
+                        "to 5, not %d", len(pieces_w))
+        ifname = space.str_w(pieces_w[0])
+        ifindex = rsocket.PacketAddress.get_ifindex_from_ifname(fd, ifname)
+        protocol = space.int_w(pieces_w[1])
+        if len(pieces_w) > 2: pkttype = space.int_w(pieces_w[2])
+        else:                 pkttype = 0
+        if len(pieces_w) > 3: hatype = space.int_w(pieces_w[3])
+        else:                 hatype = 0
+        if len(pieces_w) > 4: haddr = space.str_w(pieces_w[4])
+        else:                 haddr = ""
+        if len(haddr) > 8:
+            raise OperationError(space.w_ValueError, space.wrap(
+                "Hardware address must be 8 bytes or less"))
+        if protocol < 0 or protocol > 0xfffff:
+            raise OperationError(space.w_OverflowError, space.wrap(
+                "protoNumber must be 0-65535."))
+        return rsocket.PacketAddress(ifindex, protocol, pkttype, hatype, haddr)
     raise RSocketError("unknown address family")
 
 # XXX Hack to seperate rpython and pypy
@@ -172,7 +195,8 @@
     # convert an app-level object into an Address
     # based on the current socket's family
     def addr_from_object(self, space, w_address):
-        return addr_from_object(self.sock.family, space, w_address)
+        fd = intmask(self.sock.fd)
+        return addr_from_object(self.sock.family, fd, space, w_address)
 
     def bind_w(self, space, w_addr):
         """bind(address)
diff --git a/pypy/module/_socket/test/test_sock_app.py b/pypy/module/_socket/test/test_sock_app.py
--- a/pypy/module/_socket/test/test_sock_app.py
+++ b/pypy/module/_socket/test/test_sock_app.py
@@ -1,4 +1,4 @@
-import sys
+import sys, os
 import py
 from pypy.tool.pytest.objspace import gettestobjspace
 from rpython.tool.udir import udir
@@ -615,6 +615,28 @@
             os.chdir(oldcwd)
 
 
+class AppTestPacket:
+    def setup_class(cls):
+        if not hasattr(os, 'getuid') or os.getuid() != 0:
+            py.test.skip("AF_PACKET needs to be root for testing")
+        w_ok = space.appexec([], "(): import _socket; " +
+                                 "return hasattr(_socket, 'AF_PACKET')")
+        if not space.is_true(w_ok):
+            py.test.skip("no AF_PACKET on this platform")
+        cls.space = space
+
+    def test_convert_between_tuple_and_sockaddr_ll(self):
+        import _socket
+        s = _socket.socket(_socket.AF_PACKET, _socket.SOCK_RAW)
+        assert s.getsockname() == ('', 0, 0, 0, '')
+        s.bind(('lo', 123))
+        a, b, c, d, e = s.getsockname()
+        assert (a, b, c) == ('lo', 123, 0)
+        assert isinstance(d, int)
+        assert isinstance(e, str)
+        assert 0 <= len(e) <= 8
+
+
 class AppTestSocketTCP:
     HOST = 'localhost'
 
diff --git a/pypy/module/zipimport/test/test_zipimport_deflated.py b/pypy/module/zipimport/test/test_zipimport_deflated.py
--- a/pypy/module/zipimport/test/test_zipimport_deflated.py
+++ b/pypy/module/zipimport/test/test_zipimport_deflated.py
@@ -14,7 +14,7 @@
     def setup_class(cls):
         try:
             import rpython.rlib.rzlib
-        except ImportError:
+        except CompilationError:
             py.test.skip("zlib not available, cannot test compressed zipfiles")
         cls.make_class()
         cls.w_BAD_ZIP = cls.space.wrap(BAD_ZIP)
diff --git a/rpython/annotator/binaryop.py b/rpython/annotator/binaryop.py
--- a/rpython/annotator/binaryop.py
+++ b/rpython/annotator/binaryop.py
@@ -132,13 +132,11 @@
         impl = pair(s_c1, s_o2).getitem
         return read_can_only_throw(impl, s_c1, s_o2)
 
-    def getitem_idx_key((s_c1, s_o2)):
+    def getitem_idx((s_c1, s_o2)):
         impl = pair(s_c1, s_o2).getitem
         return impl()
-    getitem_idx_key.can_only_throw = _getitem_can_only_throw
+    getitem_idx.can_only_throw = _getitem_can_only_throw
 
-    getitem_idx = getitem_idx_key
-    getitem_key = getitem_idx_key
 
 
 class __extend__(pairtype(SomeType, SomeType),
@@ -565,14 +563,10 @@
         return lst1.listdef.read_item()
     getitem.can_only_throw = []
 
-    getitem_key = getitem
-
     def getitem_idx((lst1, int2)):
         return lst1.listdef.read_item()
     getitem_idx.can_only_throw = [IndexError]
 
-    getitem_idx_key = getitem_idx
-
     def setitem((lst1, int2), s_value):
         lst1.listdef.mutate()
         lst1.listdef.generalize(s_value)
@@ -588,14 +582,10 @@
         return SomeChar(no_nul=str1.no_nul)
     getitem.can_only_throw = []
 
-    getitem_key = getitem
-
     def getitem_idx((str1, int2)):
         return SomeChar(no_nul=str1.no_nul)
     getitem_idx.can_only_throw = [IndexError]
 
-    getitem_idx_key = getitem_idx
-
     def mul((str1, int2)): # xxx do we want to support this
         return SomeString(no_nul=str1.no_nul)
 
@@ -604,14 +594,10 @@
         return SomeUnicodeCodePoint()
     getitem.can_only_throw = []
 
-    getitem_key = getitem
-
     def getitem_idx((str1, int2)):
         return SomeUnicodeCodePoint()
     getitem_idx.can_only_throw = [IndexError]
 
-    getitem_idx_key = getitem_idx
-
     def mul((str1, int2)): # xxx do we want to support this
         return SomeUnicodeString()
 
diff --git a/rpython/doc/jit/index.rst b/rpython/doc/jit/index.rst
--- a/rpython/doc/jit/index.rst
+++ b/rpython/doc/jit/index.rst
@@ -23,11 +23,15 @@
 
    overview
    pyjitpl5
+   optimizer
    virtualizable
 
 - :doc:`Overview <overview>`: motivating our approach
 
 - :doc:`Notes <pyjitpl5>` about the current work in PyPy
 
+- :doc:`Optimizer <optimizer>`: the step between tracing and writing
+  machine code
+
 - :doc:`Virtulizable <virtualizable>` how virtualizables work and what they are
   (in other words how to make frames more efficient).
diff --git a/rpython/doc/jit/optimizer.rst b/rpython/doc/jit/optimizer.rst
new file mode 100644
--- /dev/null
+++ b/rpython/doc/jit/optimizer.rst
@@ -0,0 +1,196 @@
+.. _trace_optimizer:
+
+Trace Optimizer
+===============
+
+Traces of user programs are not directly translated into machine code.
+The optimizer module implements several different semantic preserving
+transformations that either allow operations to be swept from the trace
+or convert them to operations that need less time or space.
+
+The optimizer is in `rpython/jit/metainterp/optimizeopt/`.
+When you try to make sense of this module, this page might get you started.
+
+Before some optimizations are explained in more detail, it is essential to
+understand how traces look like.
+The optimizer comes with a test suit. It contains many trace
+examples and you might want to take a look at it
+(in `rpython/jit/metainterp/optimizeopt/test/*.py`).
+The allowed operations can be found in `rpython/jit/metainterp/resoperation.py`.
+Here is an example of a trace::
+
+    [p0,i0,i1]
+    label(p0, i0, i1)
+    i2 = getarray_item_raw(p0, i0, descr=<Array Signed>)
+    i3 = int_add(i1,i2)
+    i4 = int_add(i0,1)
+    i5 = int_le(i4, 100) # lower-or-equal
+    guard_true(i5)
+    jump(p0, i4, i3)
+
+At the beginning it might be clumsy to read but it makes sense when you start
+to compare the Python code that constructed the trace::
+
+    from array import array
+    a = array('i',range(101))
+    sum = 0; i = 0
+    while i <= 100: # can be seen as label
+        sum += a[i]
+        i += 1
+        # jumps back to the while header
+
+There are better ways to compute the sum from ``[0..100]``, but it gives a better intuition on how
+traces are constructed than ``sum(range(101))``.
+Note that the trace syntax is the one used in the test suite. It is also very
+similar to traces printed at runtime by PYPYLOG_. The first line gives the input variables, the
+second line is a ``label`` operation, the last one is the backwards ``jump`` operation.
+
+.. _PYPYLOG: logging.html
+
+These instructions mentioned earlier are special:
+
+* the input defines the input parameter type and name to enter the trace.
+* ``label`` is the instruction a ``jump`` can target. Label instructions have
+  a ``JitCellToken`` associated that uniquely identifies the label. Any jump
+  has a target token of a label.
+
+The token is saved in a so called `descriptor` of the instruction. It is
+not written explicitly because it is not done in the tests either. But
+the test suite creates a dummy token for each trace and adds it as descriptor
+to ``label`` and ``jump``. Of course the optimizer does the same at runtime,
+but using real values.
+The sample trace includes a descriptor in ``getarrayitem_raw``. Here it
+annotates the type of the array. It is a signed integer array.
+
+High level overview
+-------------------
+
+Before the JIT backend transforms any trace into machine code, it tries to
+transform the trace into an equivalent trace that executes faster. The method
+`optimize_trace` in `rpython/jit/metainterp/optimizeopt/__init__.py` is the
+main entry point.
+
+Optimizations are applied in a sequence one after another and the base
+sequence is as follows::
+
+    intbounds:rewrite:virtualize:string:earlyforce:pure:heap:unroll
+
+Each of the colon-separated name has a class attached, inheriting from
+the `Optimization` class.  The `Optimizer` class itself also
+derives from the `Optimization` class and implements the control logic for
+the optimization. Most of the optimizations only require a single forward pass.
+The trace is 'propagated' into each optimization using the method
+`propagate_forward`. Instruction by instruction, it flows from the
+first optimization to the last optimization. The method `emit_operation`
+is called for every operation that is passed to the next optimizer.
+
+A frequently encountered pattern
+--------------------------------
+
+To find potential optimization targets it is necessary to know the instruction
+type. Simple solution is to switch using the operation number (= type)::
+
+    for op in operations:
+        if op.getopnum() == rop.INT_ADD:
+            # handle this instruction
+            pass
+        elif op.getopnum() == rop.INT_FLOOR_DIV:
+            pass
+        # and many more
+
+Things get worse if you start to match the arguments
+(is argument one constant and two variable or vice versa?). The pattern to tackle
+this code bloat is to move it to a separate method using
+`make_dispatcher_method`. It associates methods with instruction types::
+
+    class OptX(Optimization):
+        def prefix_INT_ADD(self, op):
+            pass # emit, transform, ...
+
+    dispatch_opt = make_dispatcher_method(OptX, 'prefix_',
+                                          default=OptX.emit_operation)
+    OptX.propagate_forward = dispatch_opt
+
+    optX = OptX()
+    for op in operations:
+        optX.propagate_forward(op)
+
+``propagate_forward`` searches for the method that is able to handle the instruction
+type. As an example `INT_ADD` will invoke `prefix_INT_ADD`. If there is no function
+for the instruction, it is routed to the default implementation (``emit_operation``
+in this example).
+
+Rewrite optimization
+--------------------
+
+The second optimization is called 'rewrite' and is commonly also known as
+strength reduction. A simple example would be that an integer multiplied
+by 2 is equivalent to the bits shifted to the left once
+(e.g. ``x * 2 == x << 1``). Not only strength reduction is done in this
+optimization but also boolean or arithmetic simplifications. Other examples
+would be: ``x & 0 == 0``, ``x - 0 == x``
+
+Whenever such an operation is encountered (e.g. ``y = x & 0``), no operation is
+emitted. Instead the variable y is made equal to 0
+(= ``make_equal_to(op.result, 0)``). The variables found in a trace are
+instances of Box classes that can be found in
+`rpython/jit/metainterp/history.py`. `OptValue` wraps those variables again
+and maps the boxes to the optimization values in the optimizer. When a
+value is made equal, the two variable's boxes are made to point to the same
+`OptValue` instance.
+
+**NOTE: this OptValue organization is currently being refactored in a branch.**
+
+Pure optimization
+-----------------
+
+Is interwoven into the basic optimizer. It saves operations, results,
+arguments to be known to have pure semantics.
+
+"Pure" here means the same as the ``jit.elidable`` decorator:
+free of "observable" side effects and referentially transparent
+(the operation can be replaced with its result without changing the program
+semantics). The operations marked as ALWAYS_PURE in `resoperation.py` are a
+subset of the NOSIDEEFFECT operations. Operations such as new, new array,
+getfield_(raw/gc) are marked as NOSIDEEFFECT but not as ALWAYS_PURE.
+
+Pure operations are optimized in two different ways.  If their arguments
+are constants, the operation is removed and the result is turned into a
+constant.  If not, we can still use a memoization technique: if, later,
+we see the same operation on the same arguments again, we don't need to
+recompute its result, but can simply reuse the previous operation's
+result.
+
+Unroll optimization
+-------------------
+
+A detailed description can be found the document
+`Loop-Aware Optimizations in PyPy's Tracing JIT`__
+
+.. __: http://www2.maths.lth.se/matematiklth/vision/publdb/reports/pdf/ardo-bolz-etal-dls-12.pdf
+
+This optimization does not fall into the traditional scheme of one forward
+pass only. In a nutshell it unrolls the trace _once_, connects the two
+traces (by inserting parameters into the jump and label of the peeled trace)
+and uses information to iron out allocations, propagate constants and
+do any other optimization currently present in the 'optimizeopt' module.
+
+It is prepended to all optimizations and thus extends the Optimizer class
+and unrolls the loop once before it proceeds.
+
+
+What is missing from this document
+----------------------------------
+
+* Guards are not explained
+* Several optimizations are not explained
+
+
+Further references
+------------------
+
+* `Allocation Removal by Partial Evaluation in a Tracing JIT`__
+* `Loop-Aware Optimizations in PyPy's Tracing JIT`__
+
+.. __: http://www.stups.uni-duesseldorf.de/mediawiki/images/b/b0/Pub-BoCuFiLePeRi2011.pdf
+.. __: http://www2.maths.lth.se/matematiklth/vision/publdb/reports/pdf/ardo-bolz-etal-dls-12.pdf
diff --git a/rpython/flowspace/model.py b/rpython/flowspace/model.py
--- a/rpython/flowspace/model.py
+++ b/rpython/flowspace/model.py
@@ -140,6 +140,12 @@
             newlink.llexitcase = self.llexitcase
         return newlink
 
+    def replace(self, mapping):
+        def rename(v):
+            if v is not None:
+                return v.replace(mapping)
+        return self.copy(rename)
+
     def settarget(self, targetblock):
         assert len(self.args) == len(targetblock.inputargs), (
             "output args mismatch")
@@ -215,13 +221,12 @@
         return uniqueitems([w for w in result if isinstance(w, Constant)])
 
     def renamevariables(self, mapping):
-        self.inputargs = [mapping.get(a, a) for a in self.inputargs]
-        for op in self.operations:
-            op.args = [mapping.get(a, a) for a in op.args]
-            op.result = mapping.get(op.result, op.result)
-        self.exitswitch = mapping.get(self.exitswitch, self.exitswitch)
+        self.inputargs = [a.replace(mapping) for a in self.inputargs]
+        self.operations = [op.replace(mapping) for op in self.operations]
+        if self.exitswitch is not None:
+            self.exitswitch = self.exitswitch.replace(mapping)
         for link in self.exits:
-            link.args = [mapping.get(a, a) for a in link.args]
+            link.args = [a.replace(mapping) for a in link.args]
 
     def closeblock(self, *exits):
         assert self.exits == [], "block already closed"
@@ -327,6 +332,8 @@
             newvar.concretetype = self.concretetype
         return newvar
 
+    def replace(self, mapping):
+        return mapping.get(self, self)
 
 
 class Constant(Hashable):
@@ -356,6 +363,9 @@
             # cannot count on it not mutating at runtime!
             return False
 
+    def replace(self, mapping):
+        return self
+
 
 class FSException(object):
     def __init__(self, w_type, w_value):
@@ -431,8 +441,8 @@
                                 ", ".join(map(repr, self.args)))
 
     def replace(self, mapping):
-        newargs = [mapping.get(arg, arg) for arg in self.args]
-        newresult = mapping.get(self.result, self.result)
+        newargs = [arg.replace(mapping) for arg in self.args]
+        newresult = self.result.replace(mapping)
         return type(self)(self.opname, newargs, newresult, self.offset)
 
 class Atom(object):
diff --git a/rpython/flowspace/operation.py b/rpython/flowspace/operation.py
--- a/rpython/flowspace/operation.py
+++ b/rpython/flowspace/operation.py
@@ -76,8 +76,8 @@
         self.offset = -1
 
     def replace(self, mapping):
-        newargs = [mapping.get(arg, arg) for arg in self.args]
-        newresult = mapping.get(self.result, self.result)
+        newargs = [arg.replace(mapping) for arg in self.args]
+        newresult = self.result.replace(mapping)
         newop = type(self)(*newargs)
         newop.result = newresult
         newop.offset = self.offset
@@ -422,8 +422,6 @@
 add_operator('delattr', 2, dispatch=1, pyfunc=delattr)
 add_operator('getitem', 2, dispatch=2, pure=True)
 add_operator('getitem_idx', 2, dispatch=2, pure=True)
-add_operator('getitem_key', 2, dispatch=2, pure=True)
-add_operator('getitem_idx_key', 2, dispatch=2, pure=True)
 add_operator('setitem', 3, dispatch=2)
 add_operator('delitem', 2, dispatch=2)
 add_operator('getslice', 3, dispatch=1, pyfunc=do_getslice, pure=True)
@@ -686,8 +684,6 @@
 # the annotator tests
 op.getitem.canraise = [IndexError, KeyError, Exception]
 op.getitem_idx.canraise = [IndexError, KeyError, Exception]
-op.getitem_key.canraise = [IndexError, KeyError, Exception]
-op.getitem_idx_key.canraise = [IndexError, KeyError, Exception]
 op.setitem.canraise = [IndexError, KeyError, Exception]
 op.delitem.canraise = [IndexError, KeyError, Exception]
 op.contains.canraise = [Exception]    # from an r_dict
diff --git a/rpython/flowspace/test/test_objspace.py b/rpython/flowspace/test/test_objspace.py
--- a/rpython/flowspace/test/test_objspace.py
+++ b/rpython/flowspace/test/test_objspace.py
@@ -867,7 +867,7 @@
                 raise
         graph = self.codetest(f)
         simplify_graph(graph)
-        assert self.all_operations(graph) == {'getitem_idx_key': 1}
+        assert self.all_operations(graph) == {'getitem_idx': 1}
 
         g = lambda: None
         def f(c, x):
@@ -877,7 +877,7 @@
                 g()
         graph = self.codetest(f)
         simplify_graph(graph)
-        assert self.all_operations(graph) == {'getitem_idx_key': 1,
+        assert self.all_operations(graph) == {'getitem_idx': 1,
                                               'simple_call': 2}
 
         def f(c, x):
@@ -896,7 +896,7 @@
                 raise
         graph = self.codetest(f)
         simplify_graph(graph)
-        assert self.all_operations(graph) == {'getitem_key': 1}
+        assert self.all_operations(graph) == {'getitem': 1}
 
         def f(c, x):
             try:
@@ -915,7 +915,7 @@
         graph = self.codetest(f)
         simplify_graph(graph)
         self.show(graph)
-        assert self.all_operations(graph) == {'getitem_idx_key': 1}
+        assert self.all_operations(graph) == {'getitem_idx': 1}
 
         def f(c, x):
             try:
@@ -933,7 +933,7 @@
                 return -1
         graph = self.codetest(f)
         simplify_graph(graph)
-        assert self.all_operations(graph) == {'getitem_key': 1}
+        assert self.all_operations(graph) == {'getitem': 1}
 
         def f(c, x):
             try:
diff --git a/rpython/rlib/_rsocket_rffi.py b/rpython/rlib/_rsocket_rffi.py
--- a/rpython/rlib/_rsocket_rffi.py
+++ b/rpython/rlib/_rsocket_rffi.py
@@ -199,7 +199,7 @@
 WSA_INVALID_PARAMETER WSA_NOT_ENOUGH_MEMORY WSA_OPERATION_ABORTED
 SIO_RCVALL SIO_KEEPALIVE_VALS
 
-SIOCGIFNAME
+SIOCGIFNAME SIOCGIFINDEX
 '''.split()
 
 for name in constant_names:
@@ -328,7 +328,8 @@
 
     if _HAS_AF_PACKET:
         CConfig.sockaddr_ll = platform.Struct('struct sockaddr_ll',
-                              [('sll_ifindex', rffi.INT),
+                              [('sll_family', rffi.INT),
+                               ('sll_ifindex', rffi.INT),
                                ('sll_protocol', rffi.INT),
                                ('sll_pkttype', rffi.INT),
                                ('sll_hatype', rffi.INT),
diff --git a/rpython/rlib/rsocket.py b/rpython/rlib/rsocket.py
--- a/rpython/rlib/rsocket.py
+++ b/rpython/rlib/rsocket.py
@@ -5,15 +5,8 @@
 a drop-in replacement for the 'socket' module.
 """
 
-# Known missing features:
-#
-#   - address families other than AF_INET, AF_INET6, AF_UNIX, AF_PACKET
-#   - AF_PACKET is only supported on Linux
-#   - methods makefile(),
-#   - SSL
-#
-# It's unclear if makefile() and SSL support belong here or only as
-# app-level code for PyPy.
+# XXX this does not support yet the least common AF_xxx address families
+# supported by CPython.  See http://bugs.pypy.org/issue1942
 
 from rpython.rlib import _rsocket_rffi as _c, jit, rgc
 from rpython.rlib.objectmodel import instantiate, keepalive_until_here
@@ -200,23 +193,49 @@
         family = AF_PACKET
         struct = _c.sockaddr_ll
         maxlen = minlen = sizeof(struct)
+        ifr_name_size = _c.ifreq.c_ifr_name.length
+        sll_addr_size = _c.sockaddr_ll.c_sll_addr.length
+
+        def __init__(self, ifindex, protocol, pkttype=0, hatype=0, haddr=""):
+            addr = lltype.malloc(_c.sockaddr_ll, flavor='raw', zero=True,
+                                 track_allocation=False)
+            self.setdata(addr, PacketAddress.maxlen)
+            rffi.setintfield(addr, 'c_sll_family', AF_PACKET)
+            rffi.setintfield(addr, 'c_sll_protocol', htons(protocol))
+            rffi.setintfield(addr, 'c_sll_ifindex', ifindex)
+            rffi.setintfield(addr, 'c_sll_pkttype', pkttype)
+            rffi.setintfield(addr, 'c_sll_hatype', hatype)
+            halen = rffi.str2chararray(haddr,
+                                       rffi.cast(rffi.CCHARP, addr.c_sll_addr),
+                                       PacketAddress.sll_addr_size)
+            rffi.setintfield(addr, 'c_sll_halen', halen)
+
+        @staticmethod
+        def get_ifindex_from_ifname(fd, ifname):
+            p = lltype.malloc(_c.ifreq, flavor='raw')
+            iflen = rffi.str2chararray(ifname,
+                                       rffi.cast(rffi.CCHARP, p.c_ifr_name),
+                                       PacketAddress.ifr_name_size - 1)
+            p.c_ifr_name[iflen] = '\0'
+            err = _c.ioctl(fd, _c.SIOCGIFINDEX, p)
+            ifindex = p.c_ifr_ifindex
+            lltype.free(p, flavor='raw')
+            if err != 0:
+                raise RSocketError("invalid interface name")
+            return ifindex
 
         def get_ifname(self, fd):
+            ifname = ""
             a = self.lock(_c.sockaddr_ll)
-            p = lltype.malloc(_c.ifreq, flavor='raw')
-            rffi.setintfield(p, 'c_ifr_ifindex',
-                             rffi.getintfield(a, 'c_sll_ifindex'))
-            if (_c.ioctl(fd, _c.SIOCGIFNAME, p) == 0):
-                # eh, the iface name is a constant length array
-                i = 0
-                d = []
-                while p.c_ifr_name[i] != '\x00' and i < len(p.c_ifr_name):
-                    d.append(p.c_ifr_name[i])
-                    i += 1
-                ifname = ''.join(d)
-            else:
-                ifname = ""
-            lltype.free(p, flavor='raw')
+            ifindex = rffi.getintfield(a, 'c_sll_ifindex')
+            if ifindex:
+                p = lltype.malloc(_c.ifreq, flavor='raw')
+                rffi.setintfield(p, 'c_ifr_ifindex', ifindex)
+                if (_c.ioctl(fd, _c.SIOCGIFNAME, p) == 0):
+                    ifname = rffi.charp2strn(
+                        rffi.cast(rffi.CCHARP, p.c_ifr_name),
+                        PacketAddress.ifr_name_size)
+                lltype.free(p, flavor='raw')
             self.unlock()
             return ifname
 
@@ -235,11 +254,11 @@
 
         def get_hatype(self):
             a = self.lock(_c.sockaddr_ll)
-            res = bool(rffi.getintfield(a, 'c_sll_hatype'))
+            res = rffi.getintfield(a, 'c_sll_hatype')
             self.unlock()
             return res
 
-        def get_addr(self):
+        def get_haddr(self):
             a = self.lock(_c.sockaddr_ll)
             lgt = rffi.getintfield(a, 'c_sll_halen')
             d = []
diff --git a/rpython/rlib/rzipfile.py b/rpython/rlib/rzipfile.py
--- a/rpython/rlib/rzipfile.py
+++ b/rpython/rlib/rzipfile.py
@@ -8,7 +8,7 @@
 
 try:
     from rpython.rlib import rzlib
-except (ImportError, CompilationError):
+except CompilationError:
     rzlib = None
 
 crc_32_tab = [
diff --git a/rpython/rlib/rzlib.py b/rpython/rlib/rzlib.py
--- a/rpython/rlib/rzlib.py
+++ b/rpython/rlib/rzlib.py
@@ -22,13 +22,10 @@
         includes=['zlib.h'],
         testonly_libraries = testonly_libraries
     )
-try:
-    eci = rffi_platform.configure_external_library(
-        libname, eci,
-        [dict(prefix='zlib-'),
-         ])
-except CompilationError:
-    raise ImportError("Could not find a zlib library")
+eci = rffi_platform.configure_external_library(
+    libname, eci,
+    [dict(prefix='zlib-'),
+     ])
 
 
 constantnames = '''
diff --git a/rpython/rlib/test/test_rzipfile.py b/rpython/rlib/test/test_rzipfile.py
--- a/rpython/rlib/test/test_rzipfile.py
+++ b/rpython/rlib/test/test_rzipfile.py
@@ -9,7 +9,7 @@
 
 try:
     from rpython.rlib import rzlib
-except ImportError, e:
+except CompilationError as e:
     py.test.skip("zlib not installed: %s " % (e, ))
 
 class BaseTestRZipFile(BaseRtypingTest):
diff --git a/rpython/rtyper/lltypesystem/rffi.py b/rpython/rtyper/lltypesystem/rffi.py
--- a/rpython/rtyper/lltypesystem/rffi.py
+++ b/rpython/rtyper/lltypesystem/rffi.py
@@ -794,6 +794,14 @@
         else:
             lltype.free(cp, flavor='raw', track_allocation=False)
 
+    # str -> already-existing char[maxsize]
+    def str2chararray(s, array, maxsize):
+        length = min(len(s), maxsize)
+        ll_s = llstrtype(s)
+        copy_string_to_raw(ll_s, array, 0, length)
+        return length
+    str2chararray._annenforceargs_ = [strtype, None, int]
+
     # char* -> str
     # doesn't free char*
     def charp2str(cp):
@@ -944,19 +952,19 @@
     return (str2charp, free_charp, charp2str,
             get_nonmovingbuffer, free_nonmovingbuffer,
             alloc_buffer, str_from_buffer, keep_buffer_alive_until_here,
-            charp2strn, charpsize2str,
+            charp2strn, charpsize2str, str2chararray,
             )
 
 (str2charp, free_charp, charp2str,
  get_nonmovingbuffer, free_nonmovingbuffer,
  alloc_buffer, str_from_buffer, keep_buffer_alive_until_here,
- charp2strn, charpsize2str,
+ charp2strn, charpsize2str, str2chararray,
  ) = make_string_mappings(str)
 
 (unicode2wcharp, free_wcharp, wcharp2unicode,
  get_nonmoving_unicodebuffer, free_nonmoving_unicodebuffer,
  alloc_unicodebuffer, unicode_from_buffer, keep_unicodebuffer_alive_until_here,
- wcharp2unicoden, wcharpsize2unicode,
+ wcharp2unicoden, wcharpsize2unicode, unicode2wchararray,
  ) = make_string_mappings(unicode)
 
 # char**
diff --git a/rpython/rtyper/lltypesystem/test/test_rffi.py b/rpython/rtyper/lltypesystem/test/test_rffi.py
--- a/rpython/rtyper/lltypesystem/test/test_rffi.py
+++ b/rpython/rtyper/lltypesystem/test/test_rffi.py
@@ -102,24 +102,19 @@
         #include <string.h>
         #include <src/mem.h>
 
-        char *f(char* arg)
+        void f(char *target, char* arg)
         {
-            char *ret;
-            /* lltype.free uses OP_RAW_FREE, we must allocate
-             * with the matching function
-             */
-            OP_RAW_MALLOC(strlen(arg) + 1, ret, char*)
-            strcpy(ret, arg);
-            return ret;
+            strcpy(target, arg);
         }
         """)
         eci = ExternalCompilationInfo(separate_module_sources=[c_source],
-                                      post_include_bits=['char *f(char*);'])
-        z = llexternal('f', [CCHARP], CCHARP, compilation_info=eci)
+                                     post_include_bits=['void f(char*,char*);'])
+        z = llexternal('f', [CCHARP, CCHARP], lltype.Void, compilation_info=eci)
 
         def f():
             s = str2charp("xxx")
-            l_res = z(s)
+            l_res = lltype.malloc(CCHARP.TO, 10, flavor='raw')
+            z(l_res, s)
             res = charp2str(l_res)
             lltype.free(l_res, flavor='raw')
             free_charp(s)
@@ -676,6 +671,23 @@
 
         assert interpret(f, [], backendopt=True) == 43
 
+    def test_str2chararray(self):
+        eci = ExternalCompilationInfo(includes=['string.h'])
+        strlen = llexternal('strlen', [CCHARP], SIZE_T,
+                            compilation_info=eci)
+        def f():
+            raw = str2charp("XxxZy")
+            n = str2chararray("abcdef", raw, 4)
+            assert raw[0] == 'a'
+            assert raw[1] == 'b'
+            assert raw[2] == 'c'
+            assert raw[3] == 'd'
+            assert raw[4] == 'y'
+            lltype.free(raw, flavor='raw')
+            return n
+
+        assert interpret(f, []) == 4
+
     def test_around_extcall(self):
         if sys.platform == "win32":
             py.test.skip('No pipes on windows')
diff --git a/rpython/rtyper/rlist.py b/rpython/rtyper/rlist.py
--- a/rpython/rtyper/rlist.py
+++ b/rpython/rtyper/rlist.py
@@ -268,13 +268,9 @@
         v_res = hop.gendirectcall(llfn, c_func_marker, c_basegetitem, v_lst, v_index)
         return r_lst.recast(hop.llops, v_res)
 
-    rtype_getitem_key = rtype_getitem
-
     def rtype_getitem_idx((r_lst, r_int), hop):
         return pair(r_lst, r_int).rtype_getitem(hop, checkidx=True)
 
-    rtype_getitem_idx_key = rtype_getitem_idx
-
     def rtype_setitem((r_lst, r_int), hop):
         if hop.has_implicit_exception(IndexError):
             spec = dum_checkidx
diff --git a/rpython/rtyper/rmodel.py b/rpython/rtyper/rmodel.py
--- a/rpython/rtyper/rmodel.py
+++ b/rpython/rtyper/rmodel.py
@@ -285,11 +285,9 @@
 
     # default implementation for checked getitems
 
-    def rtype_getitem_idx_key((r_c1, r_o1), hop):
+    def rtype_getitem_idx((r_c1, r_o1), hop):
         return pair(r_c1, r_o1).rtype_getitem(hop)
 
-    rtype_getitem_idx = rtype_getitem_idx_key
-    rtype_getitem_key = rtype_getitem_idx_key
 
 # ____________________________________________________________
 
diff --git a/rpython/rtyper/rstr.py b/rpython/rtyper/rstr.py
--- a/rpython/rtyper/rstr.py
+++ b/rpython/rtyper/rstr.py
@@ -580,13 +580,9 @@
             hop.exception_cannot_occur()
         return hop.gendirectcall(llfn, v_str, v_index)
 
-    rtype_getitem_key = rtype_getitem
-
     def rtype_getitem_idx((r_str, r_int), hop):
         return pair(r_str, r_int).rtype_getitem(hop, checkidx=True)
 
-    rtype_getitem_idx_key = rtype_getitem_idx
-
     def rtype_mul((r_str, r_int), hop):
         str_repr = r_str.repr
         v_str, v_int = hop.inputargs(str_repr, Signed)
diff --git a/rpython/translator/platform/arm.py b/rpython/translator/platform/arm.py
--- a/rpython/translator/platform/arm.py
+++ b/rpython/translator/platform/arm.py
@@ -21,11 +21,14 @@
 
     available_librarydirs = [SB2 + '/lib/arm-linux-gnueabi/',
                              SB2 + '/lib/arm-linux-gnueabihf/',
+                             SB2 + '/lib/aarch64-linux-gnu/',
                              SB2 + '/usr/lib/arm-linux-gnueabi/',
-                             SB2 + '/usr/lib/arm-linux-gnueabihf/']
+                             SB2 + '/usr/lib/arm-linux-gnueabihf/',
+                             SB2 + '/usr/lib/aarch64-linux-gnu/']
 
     available_includedirs = [SB2 + '/usr/include/arm-linux-gnueabi/',
-                             SB2 + '/usr/include/arm-linux-gnueabihf/']
+                             SB2 + '/usr/include/arm-linux-gnueabihf/',
+                             SB2 + '/usr/include/aarch64-linux-gnu/']
     copied_cache = {}
 
 
diff --git a/rpython/translator/platform/posix.py b/rpython/translator/platform/posix.py
--- a/rpython/translator/platform/posix.py
+++ b/rpython/translator/platform/posix.py
@@ -167,7 +167,8 @@
             ('CC', self.cc),
             ('CC_LINK', eci.use_cpp_linker and 'g++' or '$(CC)'),
             ('LINKFILES', eci.link_files),
-            ('RPATH_FLAGS', self.rpath_flags),
+            ('RPATH_FLAGS', self.rpath_flags + ['-Wl,-rpath-link=\'%s\'' % ldir
+                                                for ldir in rel_libdirs]),
             ]
         for args in definitions:
             m.definition(*args)
diff --git a/rpython/translator/simplify.py b/rpython/translator/simplify.py
--- a/rpython/translator/simplify.py
+++ b/rpython/translator/simplify.py
@@ -63,28 +63,19 @@
     When this happens, we need to replace the preceeding link with the
     following link.  Arguments of the links should be updated."""
     for link in list(graph.iterlinks()):
-            while not link.target.operations:
-                block1 = link.target
-                if block1.exitswitch is not None:
-                    break
-                if not block1.exits:
-                    break
-                exit = block1.exits[0]
-                assert block1 is not exit.target, (
-                    "the graph contains an empty infinite loop")
-                outputargs = []
-                for v in exit.args:
-                    if isinstance(v, Variable):
-                        try:
-                            i = block1.inputargs.index(v)
-                            v = link.args[i]
-                        except ValueError:
-                            # the variable was passed implicitly to block1
-                            pass
-                    outputargs.append(v)
-                link.args = outputargs
-                link.target = exit.target
-                # the while loop above will simplify recursively the new link
+        while not link.target.operations:
+            block1 = link.target
+            if block1.exitswitch is not None:
+                break
+            if not block1.exits:
+                break
+            exit = block1.exits[0]
+            assert block1 is not exit.target, (
+                "the graph contains an empty infinite loop")
+            subst = dict(zip(block1.inputargs, link.args))
+            link.args = [v.replace(subst) for v in exit.args]
+            link.target = exit.target
+            # the while loop above will simplify recursively the new link
 
 def transform_ovfcheck(graph):
     """The special function calls ovfcheck needs to
@@ -132,9 +123,6 @@
     the block's single list of exits.
     """
     renaming = {}
-    def rename(v):
-        return renaming.get(v, v)
-
     for block in graph.iterblocks():
         if not (block.canraise
                 and block.exits[-1].exitcase is Exception):
@@ -151,10 +139,10 @@
         while len(query.exits) == 2:
             newrenaming = {}
             for lprev, ltarg in zip(exc.args, query.inputargs):
-                newrenaming[ltarg] = rename(lprev)
+                newrenaming[ltarg] = lprev.replace(renaming)
             op = query.operations[0]
             if not (op.opname in ("is_", "issubtype") and
-                    newrenaming.get(op.args[0]) == last_exception):
+                    op.args[0].replace(newrenaming) == last_exception):
                 break
             renaming.update(newrenaming)
             case = query.operations[0].args[-1].value
@@ -177,19 +165,16 @@
         # construct the block's new exits
         exits = []
         for case, oldlink in switches:
-            link = oldlink.copy(rename)
+            link = oldlink.replace(renaming)
             assert case is not None
             link.last_exception = last_exception
             link.last_exc_value = last_exc_value
             # make the above two variables unique
             renaming2 = {}
-            def rename2(v):
-                return renaming2.get(v, v)
             for v in link.getextravars():
                 renaming2[v] = Variable(v)
-            link = link.copy(rename2)
+            link = link.replace(renaming2)
             link.exitcase = case
-            link.prevblock = block
             exits.append(link)
         block.recloseblock(*(preserve + exits))
 
@@ -203,8 +188,6 @@
                 for exit in block.exits:
                     if exit.exitcase is IndexError:
                         postfx.append('idx')
-                    elif exit.exitcase is KeyError:
-                        postfx.append('key')
                 if postfx:
                     Op = getattr(op, '_'.join(['getitem'] + postfx))
                     newop = Op(*last_op.args)
@@ -313,8 +296,6 @@
             renaming = {}
             for vprev, vtarg in zip(link.args, link.target.inputargs):
                 renaming[vtarg] = vprev
-            def rename(v):
-                return renaming.get(v, v)
             def rename_op(op):
                 op = op.replace(renaming)
                 # special case...
@@ -328,9 +309,12 @@
                 link.prevblock.operations.append(rename_op(op))
             exits = []
             for exit in link.target.exits:
-                newexit = exit.copy(rename)
+                newexit = exit.replace(renaming)
                 exits.append(newexit)
-            newexitswitch = rename(link.target.exitswitch)
+            if link.target.exitswitch:
+                newexitswitch = link.target.exitswitch.replace(renaming)
+            else:
+                newexitswitch = None
             link.prevblock.exitswitch = newexitswitch
             link.prevblock.recloseblock(*exits)
             if (isinstance(newexitswitch, Constant) and
@@ -618,7 +602,7 @@
         assert len(links) == len(new_args)
         for link, args in zip(links, new_args):
             link.args = args
-    for block in graph.iterblocks():
+    for block in entrymap:
         block.renamevariables(renaming)
 
 def remove_identical_vars(graph):