[pypy-commit] pypy release-2.5.x: merge default into branch

Sat Mar 14 14:50:52 CET 2015

Author: Armin Rigo <arigo at tunes.org>
Branch: release-2.5.x
Changeset: r76367:976b02d5d893
Date: 2015-03-14 14:50 +0100
http://bitbucket.org/pypy/pypy/changeset/976b02d5d893/

Log:	merge default into branch

diff --git a/pypy/doc/stm.rst b/pypy/doc/stm.rst
--- a/pypy/doc/stm.rst
+++ b/pypy/doc/stm.rst
@@ -45,11 +45,10 @@
   it as a drop-in replacement and multithreaded programs will run on
   multiple cores.
 
-* ``pypy-stm`` does not impose any special API to the user, but it
-  provides a new pure Python module called `transactional_memory`_ with
-  features to inspect the state or debug conflicts_ that prevent
-  parallelization.  This module can also be imported on top of a non-STM
-  PyPy or CPython.
+* ``pypy-stm`` provides (but does not impose) a special API to the
+  user in the pure Python module `transaction`_.  This module is based
+  on the lower-level module `pypystm`_, but also provides some
+  compatibily with non-STM PyPy's or CPython's.
 
 * Building on top of the way the GIL is removed, we will talk
   about `Atomic sections, Transactions, etc.: a better way to write
@@ -63,9 +62,10 @@
 
 Development is done in the branch `stmgc-c7`_.  If you are only
 interested in trying it out, you can download a Ubuntu binary here__
-(``pypy-stm-2.3*.tar.bz2``, Ubuntu 12.04-14.04).  The current version
+(``pypy-stm-2.*.tar.bz2``, for Ubuntu 12.04-14.04).  The current version
 supports four "segments", which means that it will run up to four
-threads in parallel.
+threads in parallel.  (Development recently switched to `stmgc-c8`_,
+but that is not ready for trying out yet.)
 
 To build a version from sources, you first need to compile a custom
 version of clang(!); we recommend downloading `llvm and clang like
@@ -78,6 +78,7 @@
    rpython/bin/rpython -Ojit --stm pypy/goal/targetpypystandalone.py
 
 .. _`stmgc-c7`: https://bitbucket.org/pypy/pypy/src/stmgc-c7/
+.. _`stmgc-c8`: https://bitbucket.org/pypy/pypy/src/stmgc-c8/
 .. __: https://bitbucket.org/pypy/pypy/downloads/
 .. __: http://clang.llvm.org/get_started.html
 .. __: https://bitbucket.org/pypy/stmgc/src/default/c7/llvmfix/
@@ -85,11 +86,11 @@
 
 .. _caveats:
 
-Current status
---------------
+Current status (stmgc-c7)
+-------------------------
 
-* So far, small examples work fine, but there are still a few bugs.
-  We're busy fixing them as we find them; feel free to `report bugs`_.
+* It seems to work fine, without crashing any more.  Please `report
+  any crash`_ you find (or other bugs).
 
 * It runs with an overhead as low as 20% on examples like "richards".
   There are also other examples with higher overheads --currently up to
@@ -97,8 +98,9 @@
   One suspect is our partial GC implementation, see below.
 
 * Currently limited to 1.5 GB of RAM (this is just a parameter in
-  `core.h`__).  Memory overflows are not correctly handled; they cause
-  segfaults.
+  `core.h`__ -- theoretically.  In practice, increase it too much and
+  clang crashes again).  Memory overflows are not correctly handled;
+  they cause segfaults.
 
 * The JIT warm-up time improved recently but is still bad.  In order to
   produce machine code, the JIT needs to enter a special single-threaded
@@ -114,11 +116,9 @@
   numbers of small objects that don't immediately die (surely a common
   situation) suffer from these missing optimizations.
 
-* The GC has no support for destructors: the ``__del__`` method is never
-  called (including on file objects, which won't be closed for you).
-  This is of course temporary.  Also, weakrefs might appear to work a
-  bit strangely for now (staying alive even though ``gc.collect()``, or
-  even dying but then un-dying for a short time before dying again).
+* Weakrefs might appear to work a bit strangely for now, sometimes
+  staying alive throught ``gc.collect()``, or even dying but then
+  un-dying for a short time before dying again.
 
 * The STM system is based on very efficient read/write barriers, which
   are mostly done (their placement could be improved a bit in
@@ -130,7 +130,7 @@
 
 * Very long-running processes (on the order of days) will eventually
   crash on an assertion error because of a non-implemented overflow of
-  an internal 29-bit number.
+  an internal 28-bit counter.
 
 .. _`report bugs`: https://bugs.pypy.org/
 .. __: https://bitbucket.org/pypy/pypy/raw/stmgc-c7/rpython/translator/stm/src_stm/stm/core.h
@@ -175,29 +175,89 @@
 
 This works by internally considering the points where a standard PyPy or
 CPython would release the GIL, and replacing them with the boundaries of
-"transaction".  Like their database equivalent, multiple transactions
+"transactions".  Like their database equivalent, multiple transactions
 can execute in parallel, but will commit in some serial order.  They
 appear to behave as if they were completely run in this serialization
 order.
 
 
+A better way to write parallel programs
+---------------------------------------
+
+In CPU-hungry programs, we can often easily identify outermost loops
+over some data structure, or other repetitive algorithm, where each
+"block" consists of processing a non-trivial amount of data, and where
+the blocks "have a good chance" to be independent from each other.  We
+don't need to prove that they are actually independent: it is enough
+if they are *often independent* --- or, more precisely, if we *think
+they should be* often independent.
+
+One typical example would look like this, where the function ``func()``
+typically invokes a large amount of code::
+
+    for key, value in bigdict.items():
+        func(key, value)
+
+Then you simply replace the loop with::
+
+    from transaction import TransactionQueue
+
+    tr = TransactionQueue()
+    for key, value in bigdict.items():
+        tr.add(func, key, value)
+    tr.run()
+
+This code's behavior is equivalent.  Internally, the
+``TransactionQueue`` object will start N threads and try to run the
+``func(key, value)`` calls on all threads in parallel.  But note the
+difference with a regular thread-pooling library, as found in many
+lower-level languages than Python: the function calls are not randomly
+interleaved with each other just because they run in parallel.  The
+behavior did not change because we are using ``TransactionQueue``.
+All the calls still *appear* to execute in some serial order.
+
+Now the performance should ideally be improved: if the function calls
+turn out to be actually independent (most of the time), then it will
+be.  But if the function calls are not, then the total performance
+will crawl back to the previous case, with additionally some small
+penalty for the overhead.
+
+This case occurs typically when you see the total CPU usage remaining
+low (closer to 1 than N cores).  Note first that it is expected that
+the CPU usage should not go much higher than 1 in the JIT warm-up
+phase.  You must run a program for several seconds, or for larger
+programs at least one minute, to give the JIT a chance to warm up
+correctly.  But if CPU usage remains low even though all code is
+executing in a ``TransactionQueue.run()``, then the ``PYPYSTM``
+environment variable can be used to track what is going on.
+
+Run your program with ``PYPYSTM=stmlog`` to produce a log file called
+``stmlog``.  Afterwards, use the ``pypy/stm/print_stm_log.py`` utility
+to inspect the content of this log file.  It produces output like
+this::
+
+    documentation in progress!
+
+
+
 Atomic sections
 ---------------
 
-PyPy supports *atomic sections,* which are blocks of code which you want
-to execute without "releasing the GIL".  *This is experimental and may
-be removed in the future.*  In STM terms, this means blocks of code that
-are executed while guaranteeing that the transaction is not interrupted
-in the middle.
+PyPy supports *atomic sections,* which are blocks of code which you
+want to execute without "releasing the GIL".  In STM terms, this means
+blocks of code that are executed while guaranteeing that the
+transaction is not interrupted in the middle.  *This is experimental
+and may be removed in the future* if `lock elision`_ is ever
+implemented.
 
 Here is a usage example::
 
-    with __pypy__.thread.atomic:
+    with transaction.atomic:
         assert len(lst1) == 10
         x = lst1.pop(0)
         lst1.append(x)
 
-In this (bad) example, we are sure that the item popped off one end of
+In this example, we are sure that the item popped off one end of
 the list is appened again at the other end atomically.  It means that
 another thread can run ``len(lst1)`` or ``x in lst1`` without any
 particular synchronization, and always see the same results,
@@ -225,21 +285,22 @@
 manually a transaction break just before the atomic block.  This is
 because the boundaries of the block are not guaranteed to be the
 boundaries of the transaction: the latter is at least as big as the
-block, but maybe bigger.  Therefore, if you run a big atomic block, it
+block, but may be bigger.  Therefore, if you run a big atomic block, it
 is a good idea to break the transaction just before.  This can be done
-e.g. by the hack of calling ``time.sleep(0)``.  (This may be fixed at
+by calling ``transaction.hint_commit_soon()``.  (This may be fixed at
 some point.)
 
-There are also issues with the interaction of locks and atomic blocks.
-This can be seen if you write to files (which have locks), including
-with a ``print`` to standard output.  If one thread tries to acquire a
-lock while running in an atomic block, and another thread has got the
-same lock, then the former may fail with a ``thread.error``.  The reason
-is that "waiting" for some condition to become true --while running in
-an atomic block-- does not really make sense.  For now you can work
-around it by making sure that, say, all your prints are either in an
-``atomic`` block or none of them are.  (This kind of issue is
-theoretically hard to solve.)
+There are also issues with the interaction of regular locks and atomic
+blocks.  This can be seen if you write to files (which have locks),
+including with a ``print`` to standard output.  If one thread tries to
+acquire a lock while running in an atomic block, and another thread
+has got the same lock at that point, then the former may fail with a
+``thread.error``.  The reason is that "waiting" for some condition to
+become true --while running in an atomic block-- does not really make
+sense.  For now you can work around it by making sure that, say, all
+your prints are either in an ``atomic`` block or none of them are.
+(This kind of issue is theoretically hard to solve and may be the
+reason for atomic block support to eventually be removed.)
 
 
 Locks
diff --git a/pypy/module/cpyext/api.py b/pypy/module/cpyext/api.py
--- a/pypy/module/cpyext/api.py
+++ b/pypy/module/cpyext/api.py
@@ -192,7 +192,7 @@
 
 class ApiFunction:
     def __init__(self, argtypes, restype, callable, error=_NOT_SPECIFIED,
-                 c_name=None):
+                 c_name=None, gil=None):
         self.argtypes = argtypes
         self.restype = restype
         self.functype = lltype.Ptr(lltype.FuncType(argtypes, restype))
@@ -208,6 +208,7 @@
         assert argnames[0] == 'space'
         self.argnames = argnames[1:]
         assert len(self.argnames) == len(self.argtypes)
+        self.gil = gil
 
     def _freeze_(self):
         return True
@@ -223,14 +224,15 @@
     def get_wrapper(self, space):
         wrapper = getattr(self, '_wrapper', None)
         if wrapper is None:
-            wrapper = make_wrapper(space, self.callable)
+            wrapper = make_wrapper(space, self.callable, self.gil)
             self._wrapper = wrapper
             wrapper.relax_sig_check = True
             if self.c_name is not None:
                 wrapper.c_name = cpyext_namespace.uniquename(self.c_name)
         return wrapper
 
-def cpython_api(argtypes, restype, error=_NOT_SPECIFIED, external=True):
+def cpython_api(argtypes, restype, error=_NOT_SPECIFIED, external=True,
+                gil=None):
     """
     Declares a function to be exported.
     - `argtypes`, `restype` are lltypes and describe the function signature.
@@ -240,6 +242,8 @@
       SytemError.
     - set `external` to False to get a C function pointer, but not exported by
       the API headers.
+    - set `gil` to "acquire", "release" or "around" to acquire the GIL,
+      release the GIL, or both
     """
     if isinstance(restype, lltype.Typedef):
         real_restype = restype.OF
@@ -262,7 +266,8 @@
             c_name = None
         else:
             c_name = func_name
-        api_function = ApiFunction(argtypes, restype, func, error, c_name=c_name)
+        api_function = ApiFunction(argtypes, restype, func, error,
+                                   c_name=c_name, gil=gil)
         func.api_func = api_function
 
         if external:
@@ -594,12 +599,15 @@
 pypy_debug_catch_fatal_exception = rffi.llexternal('pypy_debug_catch_fatal_exception', [], lltype.Void)
 
 # Make the wrapper for the cases (1) and (2)
-def make_wrapper(space, callable):
+def make_wrapper(space, callable, gil=None):
     "NOT_RPYTHON"
     names = callable.api_func.argnames
     argtypes_enum_ui = unrolling_iterable(enumerate(zip(callable.api_func.argtypes,
         [name.startswith("w_") for name in names])))
     fatal_value = callable.api_func.restype._defl()
+    gil_acquire = (gil == "acquire" or gil == "around")
+    gil_release = (gil == "release" or gil == "around")
+    assert gil is None or gil_acquire or gil_release
 
     @specialize.ll()
     def wrapper(*args):
@@ -607,6 +615,10 @@
         from pypy.module.cpyext.pyobject import Reference
         # we hope that malloc removal removes the newtuple() that is
         # inserted exactly here by the varargs specializer
+        if gil_acquire:
+            after = rffi.aroundstate.after
+            if after:
+                after()
         rffi.stackcounter.stacks_counter += 1
         llop.gc_stack_bottom(lltype.Void)   # marker for trackgcroot.py
         retval = fatal_value
@@ -678,6 +690,10 @@
                 print str(e)
                 pypy_debug_catch_fatal_exception()
         rffi.stackcounter.stacks_counter -= 1
+        if gil_release:
+            before = rffi.aroundstate.before
+            if before:
+                before()
         return retval
     callable._always_inline_ = 'try'
     wrapper.__name__ = "wrapper for %r" % (callable, )
diff --git a/pypy/module/cpyext/pystate.py b/pypy/module/cpyext/pystate.py
--- a/pypy/module/cpyext/pystate.py
+++ b/pypy/module/cpyext/pystate.py
@@ -19,7 +19,7 @@
 class NoThreads(Exception):
     pass
 
- at cpython_api([], PyThreadState, error=CANNOT_FAIL)
+ at cpython_api([], PyThreadState, error=CANNOT_FAIL, gil="release")
 def PyEval_SaveThread(space):
     """Release the global interpreter lock (if it has been created and thread
     support is enabled) and reset the thread state to NULL, returning the
@@ -29,19 +29,15 @@
     state = space.fromcache(InterpreterState)
     tstate = state.swap_thread_state(
         space, lltype.nullptr(PyThreadState.TO))
-    if rffi.aroundstate.before:
-        rffi.aroundstate.before()
     return tstate
 
- at cpython_api([PyThreadState], lltype.Void)
+ at cpython_api([PyThreadState], lltype.Void, gil="acquire")
 def PyEval_RestoreThread(space, tstate):
     """Acquire the global interpreter lock (if it has been created and thread
     support is enabled) and set the thread state to tstate, which must not be
     NULL.  If the lock has been created, the current thread must not have
     acquired it, otherwise deadlock ensues.  (This function is available even
     when thread support is disabled at compile time.)"""
-    if rffi.aroundstate.after:
-        rffi.aroundstate.after()
     state = space.fromcache(InterpreterState)
     state.swap_thread_state(space, tstate)
 
@@ -182,17 +178,14 @@
     state = space.fromcache(InterpreterState)
     return state.swap_thread_state(space, tstate)
 
- at cpython_api([PyThreadState], lltype.Void)
+ at cpython_api([PyThreadState], lltype.Void, gil="acquire")
 def PyEval_AcquireThread(space, tstate):
     """Acquire the global interpreter lock and set the current thread state to
     tstate, which should not be NULL.  The lock must have been created earlier.
     If this thread already has the lock, deadlock ensues.  This function is not
     available when thread support is disabled at compile time."""
-    if rffi.aroundstate.after:
-        # After external call is before entering Python
-        rffi.aroundstate.after()
 
- at cpython_api([PyThreadState], lltype.Void)
+ at cpython_api([PyThreadState], lltype.Void, gil="release")
 def PyEval_ReleaseThread(space, tstate):
     """Reset the current thread state to NULL and release the global interpreter
     lock.  The lock must have been created earlier and must be held by the current
@@ -200,28 +193,20 @@
     that it represents the current thread state --- if it isn't, a fatal error is
     reported. This function is not available when thread support is disabled at
     compile time."""
-    if rffi.aroundstate.before:
-        # Before external call is after running Python
-        rffi.aroundstate.before()
 
 PyGILState_STATE = rffi.INT
 
- at cpython_api([], PyGILState_STATE, error=CANNOT_FAIL)
+ at cpython_api([], PyGILState_STATE, error=CANNOT_FAIL, gil="acquire")
 def PyGILState_Ensure(space):
     # XXX XXX XXX THIS IS A VERY MINIMAL IMPLEMENTATION THAT WILL HAPPILY
     # DEADLOCK IF CALLED TWICE ON THE SAME THREAD, OR CRASH IF CALLED IN A
     # NEW THREAD.  We should very carefully follow what CPython does instead.
-    if rffi.aroundstate.after:
-        # After external call is before entering Python
-        rffi.aroundstate.after()
     return rffi.cast(PyGILState_STATE, 0)
 
- at cpython_api([PyGILState_STATE], lltype.Void)
+ at cpython_api([PyGILState_STATE], lltype.Void, gil="release")
 def PyGILState_Release(space, state):
     # XXX XXX XXX We should very carefully follow what CPython does instead.
-    if rffi.aroundstate.before:
-        # Before external call is after running Python
-        rffi.aroundstate.before()
+    pass
 
 @cpython_api([], PyInterpreterState, error=CANNOT_FAIL)
 def PyInterpreterState_Head(space):
@@ -236,7 +221,8 @@
     """
     return lltype.nullptr(PyInterpreterState.TO)
 
- at cpython_api([PyInterpreterState], PyThreadState, error=CANNOT_FAIL)
+ at cpython_api([PyInterpreterState], PyThreadState, error=CANNOT_FAIL,
+             gil="around")
 def PyThreadState_New(space, interp):
     """Create a new thread state object belonging to the given interpreter
     object.  The global interpreter lock need not be held, but may be held if
@@ -245,12 +231,8 @@
         raise NoThreads
     # PyThreadState_Get will allocate a new execution context,
     # we need to protect gc and other globals with the GIL.
-    rffi.aroundstate.after()
-    try:
-        rthread.gc_thread_start()
-        return PyThreadState_Get(space)
-    finally:
-        rffi.aroundstate.before()
+    rthread.gc_thread_start()
+    return PyThreadState_Get(space)
 
 @cpython_api([PyThreadState], lltype.Void)
 def PyThreadState_Clear(space, tstate):
diff --git a/pypy/module/cpyext/test/test_translate.py b/pypy/module/cpyext/test/test_translate.py
--- a/pypy/module/cpyext/test/test_translate.py
+++ b/pypy/module/cpyext/test/test_translate.py
@@ -11,7 +11,7 @@
     FT = lltype.FuncType([], lltype.Signed)
     FTPTR = lltype.Ptr(FT)
 
-    def make_wrapper(space, func):
+    def make_wrapper(space, func, gil=None):
         def wrapper():
             return func(space)
         return wrapper