[pypy-commit] pypy default: Finish refactoring this document

Mon Mar 16 10:51:24 CET 2015

Author: Armin Rigo <arigo at tunes.org>
Branch: 
Changeset: r76396:c892964d02b7
Date: 2015-03-16 10:51 +0100
http://bitbucket.org/pypy/pypy/changeset/c892964d02b7/

Log:	Finish refactoring this document

diff --git a/pypy/doc/stm.rst b/pypy/doc/stm.rst
--- a/pypy/doc/stm.rst
+++ b/pypy/doc/stm.rst
@@ -46,13 +46,14 @@
   multiple cores.
 
 * ``pypy-stm`` provides (but does not impose) a special API to the
-  user in the pure Python module `transaction`_.  This module is based
-  on the lower-level module `pypystm`_, but also provides some
+  user in the pure Python module ``transaction``.  This module is based
+  on the lower-level module ``pypystm``, but also provides some
   compatibily with non-STM PyPy's or CPython's.
 
 * Building on top of the way the GIL is removed, we will talk
-  about `Atomic sections, Transactions, etc.: a better way to write
-  parallel programs`_.
+  about `How to write multithreaded programs: the 10'000-feet view`_
+  and `transaction.TransactionQueue`_.
+
 
 
 Getting Started
@@ -89,7 +90,7 @@
 Current status (stmgc-c7)
 -------------------------
 
-* It seems to work fine, without crashing any more.  Please `report
+* **NEW:** It seems to work fine, without crashing any more.  Please `report
   any crash`_ you find (or other bugs).
 
 * It runs with an overhead as low as 20% on examples like "richards".
@@ -97,24 +98,35 @@
   2x for "translate.py"-- which we are still trying to understand.
   One suspect is our partial GC implementation, see below.
 
+* **NEW:** the ``PYPYSTM`` environment variable and the
+  ``pypy/stm/print_stm_log.py`` script let you know exactly which
+  "conflicts" occurred.  This is described in the section
+  `transaction.TransactionQueue`_ below.
+
+* **NEW:** special transaction-friendly APIs (like ``stmdict``),
+  described in the section `transaction.TransactionQueue`_ below.  The
+  old API changed again, mostly moving to different modules.  Sorry
+  about that.  I feel it's a better idea to change the API early
+  instead of being stuck with a bad one later...
+
 * Currently limited to 1.5 GB of RAM (this is just a parameter in
   `core.h`__ -- theoretically.  In practice, increase it too much and
   clang crashes again).  Memory overflows are not correctly handled;
   they cause segfaults.
 
-* The JIT warm-up time improved recently but is still bad.  In order to
-  produce machine code, the JIT needs to enter a special single-threaded
-  mode for now.  This means that you will get bad performance results if
-  your program doesn't run for several seconds, where *several* can mean
-  *many.*  When trying benchmarks, be sure to check that you have
-  reached the warmed state, i.e. the performance is not improving any
-  more.  This should be clear from the fact that as long as it's
-  producing more machine code, ``pypy-stm`` will run on a single core.
+* **NEW:** The JIT warm-up time improved again, but is still
+  relatively large.  In order to produce machine code, the JIT needs
+  to enter "inevitable" mode.  This means that you will get bad
+  performance results if your program doesn't run for several seconds,
+  where *several* can mean *many.* When trying benchmarks, be sure to
+  check that you have reached the warmed state, i.e. the performance
+  is not improving any more.
 
 * The GC is new; although clearly inspired by PyPy's regular GC, it
   misses a number of optimizations for now.  Programs allocating large
   numbers of small objects that don't immediately die (surely a common
-  situation) suffer from these missing optimizations.
+  situation) suffer from these missing optimizations.  (The bleeding
+  edge ``stmgc-c8`` is better at that.)
 
 * Weakrefs might appear to work a bit strangely for now, sometimes
   staying alive throught ``gc.collect()``, or even dying but then
@@ -122,8 +134,7 @@
 
 * The STM system is based on very efficient read/write barriers, which
   are mostly done (their placement could be improved a bit in
-  JIT-generated machine code).  But the overall bookkeeping logic could
-  see more improvements (see `Low-level statistics`_ below).
+  JIT-generated machine code).
 
 * Forking the process is slow because the complete memory needs to be
   copied manually.  A warning is printed to this effect.
@@ -132,7 +143,8 @@
   crash on an assertion error because of a non-implemented overflow of
   an internal 28-bit counter.
 
-.. _`report bugs`: https://bugs.pypy.org/
+
+.. _`report any crash`: https://bitbucket.org/pypy/pypy/issues?status=new&status=open
 .. __: https://bitbucket.org/pypy/pypy/raw/stmgc-c7/rpython/translator/stm/src_stm/stm/core.h
 
 
@@ -155,7 +167,6 @@
 interpreter and other ones might have slightly different needs.
 
 
-
 User Guide
 ==========
 
@@ -181,8 +192,33 @@
 order.
 
 
-A better way to write parallel programs
----------------------------------------
+How to write multithreaded programs: the 10'000-feet view
+---------------------------------------------------------
+
+PyPy-STM offers two ways to write multithreaded programs:
+
+* the traditional way, using the ``thread`` or ``threading`` modules.
+
+* using ``TransactionQueue``, described next__, as a way to hide the
+  low-level notion of threads.
+
+.. __: `transaction.TransactionQueue`_
+
+``TransactionQueue`` hides the hard multithreading-related issues that
+we typically encounter when using low-level threads.  This is not the
+first alternative approach to avoid dealing with low-level threads;
+for example, OpenMP_ is one.  However, it is one of the first ones
+which does not require the code to be organized in a particular
+fashion.  Instead, it works on any Python program which has got
+*latent* and *imperfect* parallelism.  Ideally, it only requires that
+the end programmer identifies where this parallelism is likely to be
+found, and communicates it to the system using a simple API.
+
+.. _OpenMP: http://en.wikipedia.org/wiki/OpenMP
+
+
+transaction.TransactionQueue
+----------------------------
 
 In CPU-hungry programs, we can often easily identify outermost loops
 over some data structure, or other repetitive algorithm, where each
@@ -322,7 +358,7 @@
 "releasing the GIL".  In STM terms, this means blocks of code that are
 executed while guaranteeing that the transaction is not interrupted in
 the middle.  *This is experimental and may be removed in the future*
-if `lock elision`_ is ever implemented.
+if `Software lock elision`_ is ever implemented.
 
 Here is a direct usage example::
 
@@ -369,7 +405,8 @@
 including with a ``print`` to standard output.  If one thread tries to
 acquire a lock while running in an atomic block, and another thread
 has got the same lock at that point, then the former may fail with a
-``thread.error``.  The reason is that "waiting" for some condition to
+``thread.error``.  (Don't rely on it; it may also deadlock.)
+The reason is that "waiting" for some condition to
 become true --while running in an atomic block-- does not really make
 sense.  For now you can work around it by making sure that, say, all
 your prints are either in an ``atomic`` block or none of them are.
@@ -428,106 +465,38 @@
 .. _`software lock elision`: https://www.repository.cam.ac.uk/handle/1810/239410
 
 
-Atomic sections, Transactions, etc.: a better way to write parallel programs
-----------------------------------------------------------------------------
+Miscellaneous functions
+-----------------------
 
-(This section is based on locks as we plan to implement them, but also
-works with the existing atomic sections.)
-
-In the cases where elision works, the block of code can run in parallel
-with other blocks of code *even if they are protected by the same lock.*
-You still get the illusion that the blocks are run sequentially.  This
-works even for multiple threads that run each a series of such blocks
-and nothing else, protected by one single global lock.  This is
-basically the Python application-level equivalent of what was done with
-the interpreter in ``pypy-stm``: while you think you are writing
-thread-unfriendly code because of this global lock, actually the
-underlying system is able to make it run on multiple cores anyway.
-
-This capability can be hidden in a library or in the framework you use;
-the end user's code does not need to be explicitly aware of using
-threads.  For a simple example of this, there is `transaction.py`_ in
-``lib_pypy``.  The idea is that you write, or already have, some program
-where the function ``f(key, value)`` runs on every item of some big
-dictionary, say::
-
-    for key, value in bigdict.items():
-        f(key, value)
-
-Then you simply replace the loop with::
-
-    for key, value in bigdict.items():
-        transaction.add(f, key, value)
-    transaction.run()
-
-This code runs the various calls to ``f(key, value)`` using a thread
-pool, but every single call is executed under the protection of a unique
-lock.  The end result is that the behavior is exactly equivalent --- in
-fact it makes little sense to do it in this way on a non-STM PyPy or on
-CPython.  But on ``pypy-stm``, the various locked calls to ``f(key,
-value)`` can tentatively be executed in parallel, even if the observable
-result is as if they were executed in some serial order.
-
-This approach hides the notion of threads from the end programmer,
-including all the hard multithreading-related issues.  This is not the
-first alternative approach to explicit threads; for example, OpenMP_ is
-one.  However, it is one of the first ones which does not require the
-code to be organized in a particular fashion.  Instead, it works on any
-Python program which has got latent, imperfect parallelism.  Ideally, it
-only requires that the end programmer identifies where this parallelism
-is likely to be found, and communicates it to the system, using for
-example the ``transaction.add()`` scheme.
-
-.. _`transaction.py`: https://bitbucket.org/pypy/pypy/raw/stmgc-c7/lib_pypy/transaction.py
-.. _OpenMP: http://en.wikipedia.org/wiki/OpenMP
-
-
-.. _`transactional_memory`:
-
-API of transactional_memory
----------------------------
-
-The new pure Python module ``transactional_memory`` runs on both CPython
-and PyPy, both with and without STM.  It contains:
-
-* ``getsegmentlimit()``: return the number of "segments" in
+* ``transaction.getsegmentlimit()``: return the number of "segments" in
   this pypy-stm.  This is the limit above which more threads will not be
   able to execute on more cores.  (Right now it is limited to 4 due to
   inter-segment overhead, but should be increased in the future.  It
   should also be settable, and the default value should depend on the
   number of actual CPUs.)  If STM is not available, this returns 1.
 
-* ``print_abort_info(minimum_time=0.0)``: debugging help.  Each thread
-  remembers the longest abort or pause it did because of cross-thread
-  contention_.  This function prints it to ``stderr`` if the time lost
-  is greater than ``minimum_time`` seconds.  The record is then
-  cleared, to make it ready for new events.  This function returns
-  ``True`` if it printed a report, and ``False`` otherwise.
+* ``__pypy__.thread.signals_enabled``: a context manager that runs its
+  block of code with signals enabled.  By default, signals are only
+  enabled in the main thread; a non-main thread will not receive
+  signals (this is like CPython).  Enabling signals in non-main
+  threads is useful for libraries where threads are hidden and the end
+  user is not expecting his code to run elsewhere than in the main
+  thread.
 
+* ``pypystm.exclusive_atomic``: a context manager similar to
+  ``transaction.atomic`` but which complains if it is nested.
 
-API of __pypy__.thread
-----------------------
+* ``transaction.is_atomic()``: return True if called from an atomic
+  context.
 
-The ``__pypy__.thread`` submodule is a built-in module of PyPy that
-contains a few internal built-in functions used by the
-``transactional_memory`` module, plus the following:
+* ``pypystm.count()``: return a different positive integer every time
+  it is called.  This works without generating conflicts.  The
+  returned integers are only roughly in increasing order; this should
+  not be relied upon.
 
-* ``__pypy__.thread.atomic``: a context manager to run a block in
-  fully atomic mode, without "releasing the GIL".  (May be eventually
-  removed?)
 
-* ``__pypy__.thread.signals_enabled``: a context manager that runs its
-  block with signals enabled.  By default, signals are only enabled in
-  the main thread; a non-main thread will not receive signals (this is
-  like CPython).  Enabling signals in non-main threads is useful for
-  libraries where threads are hidden and the end user is not expecting
-  his code to run elsewhere than in the main thread.
-
-
-.. _contention:
-
-Conflicts
----------
+More details about conflicts
+----------------------------
 
 Based on Software Transactional Memory, the ``pypy-stm`` solution is
 prone to "conflicts".  To repeat the basic idea, threads execute their code
@@ -543,7 +512,7 @@
 the transaction).  If this occurs too often, parallelization fails.
 
 How much actual parallelization a multithreaded program can see is a bit
-subtle.  Basically, a program not using ``__pypy__.thread.atomic`` or
+subtle.  Basically, a program not using ``transaction.atomic`` or
 eliding locks, or doing so for very short amounts of time, will
 parallelize almost freely (as long as it's not some artificial example
 where, say, all threads try to increase the same global counter and do
@@ -555,13 +524,14 @@
 overview.
 
 Parallelization works as long as two principles are respected.  The
-first one is that the transactions must not *conflict* with each other.
-The most obvious sources of conflicts are threads that all increment a
-global shared counter, or that all store the result of their
-computations into the same list --- or, more subtly, that all ``pop()``
-the work to do from the same list, because that is also a mutation of
-the list.  (It is expected that some STM-aware library will eventually
-be designed to help with conflict problems, like a STM-aware queue.)
+first one is that the transactions must not *conflict* with each
+other.  The most obvious sources of conflicts are threads that all
+increment a global shared counter, or that all store the result of
+their computations into the same list --- or, more subtly, that all
+``pop()`` the work to do from the same list, because that is also a
+mutation of the list.  (You can work around it with
+``transaction.stmdict``, but for that specific example, some STM-aware
+queue should eventually be designed.)
 
 A conflict occurs as follows: when a transaction commits (i.e. finishes
 successfully) it may cause other transactions that are still in progress
@@ -577,22 +547,23 @@
 Another issue is that of avoiding long-running so-called "inevitable"
 transactions ("inevitable" is taken in the sense of "which cannot be
 avoided", i.e. transactions which cannot abort any more).  Transactions
-like that should only occur if you use ``__pypy__.thread.atomic``,
-generally become of I/O in atomic blocks.  They work, but the
+like that should only occur if you use ``atomic``,
+generally because of I/O in atomic blocks.  They work, but the
 transaction is turned inevitable before the I/O is performed.  For all
 the remaining execution time of the atomic block, they will impede
 parallel work.  The best is to organize the code so that such operations
-are done completely outside ``__pypy__.thread.atomic``.
+are done completely outside ``atomic``.
 
-(This is related to the fact that blocking I/O operations are
+(This is not unrelated to the fact that blocking I/O operations are
 discouraged with Twisted, and if you really need them, you should do
 them on their own separate thread.)
 
-In case of lock elision, we don't get long-running inevitable
-transactions, but a different problem can occur: doing I/O cancels lock
-elision, and the lock turns into a real lock, preventing other threads
-from committing if they also need this lock.  (More about it when lock
-elision is implemented and tested.)
+In case lock elision eventually replaces atomic sections, we wouldn't
+get long-running inevitable transactions, but the same problem occurs
+in a different way: doing I/O cancels lock elision, and the lock turns
+into a real lock.  This prevents other threads from committing if they
+also need this lock.  (More about it when lock elision is implemented
+and tested.)
 
 
 
@@ -602,56 +573,18 @@
 XXX this section mostly empty for now
 
 
-Low-level statistics
---------------------
-
-When a non-main thread finishes, you get low-level statistics printed to
-stderr, looking like that::
-
-      thread 0x7f73377fe600:
-          outside transaction          42182    0.506 s
-          run current                  85466    0.000 s
-          run committed                34262    3.178 s
-          run aborted write write       6982    0.083 s
-          run aborted write read         550    0.005 s
-          run aborted inevitable         388    0.010 s
-          run aborted other                0    0.000 s
-          wait free segment                0    0.000 s
-          wait write read                 78    0.027 s
-          wait inevitable                887    0.490 s
-          wait other                       0    0.000 s
-          sync commit soon                 1    0.000 s
-          bookkeeping                  51418    0.606 s
-          minor gc                    162970    1.135 s
-          major gc                         1    0.019 s
-          sync pause                   59173    1.738 s
-          longest recordered marker          0.000826 s
-          "File "x.py", line 5, in f"
-
-On each line, the first number is a counter, and the second number gives
-the associated time --- the amount of real time that the thread was in
-this state.  The sum of all the times should be equal to the total time
-between the thread's start and the thread's end.  The most important
-points are "run committed", which gives the amount of useful work, and
-"outside transaction", which should give the time spent e.g. in library
-calls (right now it seems to be larger than that; to investigate).  The
-various "run aborted" and "wait" entries are time lost due to
-conflicts_.  Everything else is overhead of various forms.  (Short-,
-medium- and long-term future work involves reducing this overhead :-)
-
-The last two lines are special; they are an internal marker read by
-``transactional_memory.print_abort_info()``.
-
-
 Reference to implementation details
 -----------------------------------
 
-The core of the implementation is in a separate C library called stmgc_,
-in the c7_ subdirectory.  Please see the `README.txt`_ for more
-information.  In particular, the notion of segment is discussed there.
+The core of the implementation is in a separate C library called
+stmgc_, in the c7_ subdirectory (current version of pypy-stm) and in
+the c8_ subdirectory (bleeding edge version).  Please see the
+`README.txt`_ for more information.  In particular, the notion of
+segment is discussed there.
 
 .. _stmgc: https://bitbucket.org/pypy/stmgc/src/default/
 .. _c7: https://bitbucket.org/pypy/stmgc/src/default/c7/
+.. _c8: https://bitbucket.org/pypy/stmgc/src/default/c8/
 .. _`README.txt`: https://bitbucket.org/pypy/stmgc/raw/default/c7/README.txt
 
 PyPy itself adds on top of it the automatic placement of read__ and write__