[pypy-commit] extradoc extradoc: Tweaks tweaks tweaks

Mon May 5 16:11:16 CEST 2014

Author: Armin Rigo <arigo at tunes.org>
Branch: extradoc
Changeset: r5239:fbc8c4259336
Date: 2014-05-05 16:11 +0200
http://bitbucket.org/pypy/extradoc/changeset/fbc8c4259336/

Log:	Tweaks tweaks tweaks

diff --git a/talk/icooolps2014/position-paper.pdf b/talk/icooolps2014/position-paper.pdf
index 8a9a6a4ac4f35959a4eec538ef2bc6d20ef057b1..d0e11f70c50dc0b0e812b86a8ade78182e267737
GIT binary patch

[cut]

diff --git a/talk/icooolps2014/position-paper.tex b/talk/icooolps2014/position-paper.tex
--- a/talk/icooolps2014/position-paper.tex
+++ b/talk/icooolps2014/position-paper.tex
@@ -72,7 +72,8 @@
   fine-grained locking, shared-nothing, and transactional memory (TM)
   approaches. We argue that software-based TM systems are the most
   promising, especially since they also enable the introduction of
-  atomic blocks as a better synchronisation mechanism in the language.
+  large, parallelisable atomic blocks as a better synchronisation
+  mechanism in the language.
 \end{abstract}
 
 %\category{CR-number}{subcategory}{third-level}
@@ -111,7 +112,7 @@
 The approach that wins in the end should perform similarly for
 single-threaded execution as compared to the GIL and be able to
 execute code in parallel on multiple cores. Furthermore, we will also
-take into account the compatibility to existing code that already uses
+take into account the compatibility with existing code that may already use
 threads for concurrency, as well as the changes that are required to
 the interpreter itself.
 
@@ -119,7 +120,7 @@
 overall winner. While it has a big performance problem currently, it
 gets more points in all the other categories. We think that it is the
 only solution that also provides a better synchronisation mechanism to
-the application in the form of paralleliseable atomic blocks.
+the application in the form of parallelisable atomic blocks.
 
 %% \subsection{Issue}
 %% The issue that we want to discuss is how to efficiently support
@@ -157,7 +158,7 @@
     \hline
     Existing applications         & ++   & ++ & -{-} & ++ & ++   \\
     \hline
-    Better synchronisation        & o    & o  & o    & o  & ++   \\
+    Better synchronisation        & o    & o  &      & o  & ++   \\
     \hline
     Implementation                & ++   & -  & ++   & ++ & ++   \\
     \hline
@@ -265,11 +266,13 @@
 acquire-release operations.
 
 Jython~\cite{webjython} is one project that implements an
-interpreter for Python on the JVM\footnote{Java Virtual Machine} and
-that uses fine-grained locking to correctly synchronise the
+interpreter for Python on the Java Virtual Machine (JVM) and
+that uses fine-grained locking\footnote{The performance impact of
+fine-grained locking is milder in Java than it would be in a typical piece
+of C code; see e.g.~\cite{biased}.} to correctly synchronise the
 interpreter. For a language like Python, one needs quite a few,
 carefully placed locks. Since there is no central location, the
-complexity of the implementation is quite a bit greater compared to
+complexity of the implementation is quite a bit larger compared to
 using a GIL. Integrating external, non-thread-safe libraries should
 however be very simple too. One could simply use one lock per library
 to avoid this issue.
@@ -296,25 +299,25 @@
 to replace it. If an application can be split into completely
 independent parts that only very rarely need to share anything, or
 only do so via an external program like a database, then it is
-sensible to have one GIL per independent part. As an example of such
-an approach we look at the
+sensible to have one GIL per independent part. At the extreme, there
+are applications that parallelise perfectly simply by running
+independent processes; some web servers and some numeric computations
+do. We will consider here a slightly more general approach: the
 \emph{multiprocessing}\footnote{https://docs.python.org/2/library/multiprocessing.html}
 module of Python. In essence, it uses process-forking to provide the
 application with multiple interpreters that can run in parallel.
 Communication is then done explicitly through pipes.
 
-Obviously not every application fits well into this model and its
-applicability is therefore quite limited. Performance is good as
+The model of explicit communication is sometimes seen as a superior
+way to synchronise concurrent applications because of its explicitness.
+However, not every application fits well into this model and its
+applicability is therefore limited. Performance is good as
 long as the application does not need to communicate a lot, because
 inter-process communication is relatively expensive. Also the
 implementation of this approach is very cheap since one can
 actually take an unmodified GIL-supported interpreter and run
-multiple of them in parallel. That way, we also inherit the
+several of them in parallel. That way, we also inherit the
 easy integration of external libraries without any changes.
-While the model of explicit communication is often seen as a
-superior way to synchronise concurrent applications because
-of its explicitness, it does not actually introduce a better
-synchronisation mechanism for applications.
 
 %% - often needs major restructuring of programs (explicit data exchange)\\
 %% - sometimes communication overhead is too large\\
@@ -361,9 +364,9 @@
 library that needs to be integrated and synchronised for use in
 multiple threads. The one thing that is missing is support for a
 better synchronisation mechanism for the application. It is not
-possible in general to expose the hardware-transactions to the
-application in the form of atomic blocks, because that would require
-much longer transactions.
+reasonable in general to expose the hardware-transactions to the
+application in the form of atomic blocks, because doing so would
+require the system to support much longer transactions.
 
 %% - false-sharing on cache-line level\\
 %% - limited capacity (caches, undocumented)\\
@@ -378,7 +381,7 @@
 no benefits at all for low numbers of threads (1-8). There are some
 attempts ~\cite{warmhoff13,spear09} that can reduce the overhead a lot,
 but scale badly or only for certain workloads. Often the benefits
-on more than one thread are too little in real world applications.
+on more than one thread are too small in real world applications.
 
 However, STM compared to HTM does not suffer from the same restricting
 limitations. Transactions can in principle be arbitrarily long.  This makes it
@@ -391,12 +394,13 @@
 parallel programming forward. Together with sequential consistency it
 provides a lot of simplification for parallel applications.
 
-While one can argue that STM requires the insertion of read and write
-barriers in the whole program, this can be done automatically and
+On the implementation level,
+while one can argue that STM requires the insertion of read and write
+barriers in the whole interpreter, this can be done automatically and
 locally by a program transformation~\cite{felber07}. There are attempts
 to do the same for fine-grained locking~\cite{bill06} but they require
-a whole program analysis since locks are inherently non-composable.
-The effectiveness of these approaches is doubtful in our use case,
+a whole program analysis since locks are inherently non-composable
+--- and their effectiveness is doubtful in our use case,
 since we execute bytecode instructions in any order defined by a
 script only known at runtime. This makes it close to impossible to
 order locks consistently or to know in advance which locks a
@@ -438,7 +442,7 @@
 tailor the system to the discussed use case which gives us an
 advantage over other STM systems that are more general. With this
 approach, initial results suggest that we can keep the overhead of STM
-below 50\%. A hybrid TM system, which also uses HTM to accelerate
+well below 50\%. A hybrid TM system, which also uses HTM to accelerate
 certain tasks, looks like a very promising direction of research
 too. We believe that further work to reduce the overhead of STM is
 very worthwhile.
@@ -558,6 +562,13 @@
   Hardware Transactional Memory in Main-Memory Databases."
   \emph{Proc. of ICDE}. 2014.
 
+\bibitem{biased}
+  Kenneth Russell and David Detlefs. 2006. Eliminating
+  synchronization-related atomic operations with biased locking and
+  bulk rebiasing. \emph{In Proceedings of the 21st annual ACM SIGPLAN
+    conference on Object-oriented programing, systems, languages, and
+    applications} (OOPSLA '06).
+
 \end{thebibliography}