[pypy-commit] stmgc default: Uh, sorry, 'design.txt' is the one I meant to commit...

Sun Sep 28 16:52:26 CEST 2014

Author: Armin Rigo <arigo at tunes.org>
Branch: 
Changeset: r1434:57c2c6d8f2ef
Date: 2014-09-28 16:52 +0200
http://bitbucket.org/pypy/stmgc/changeset/57c2c6d8f2ef/

Log:	Uh, sorry, 'design.txt' is the one I meant to commit...

diff --git a/hashtable/design.txt b/hashtable/design.txt
new file mode 100644
--- /dev/null
+++ b/hashtable/design.txt
@@ -0,0 +1,72 @@
+Goal
+======
+
+The goal is to have dictionaries where a read-write or write-write
+conflict does not cause aborts if they occur with keys that have
+different 64-bit hashes.
+
+(We might prefer the condition to be "on different keys even in case of
+hash collision", but that's hard to achieve in general: for Python
+dicts, "to be equal" involves calling the __eq__() method of objects.)
+
+We distinguish between reading a value associated to a key, and only
+checking that the key is in the dictionary.  It makes a difference if
+a concurrent transaction writes to the value.
+
+Some operations on a dict (particularly len() and __nonzero__()) involve
+the global state of the dict, and thus cause conflict with any write
+that adds or removes keys.  They should not cause conflicts with reads,
+or with writes that only change existing values.
+
+In theory, we might refine this to: a len() only conflicts with a
+different transaction whose net effect is to change the length (adding 3
+keys and removing 3 other keys is fine); and a __nonzero__() only
+conflicts with a transaction whose net effect is to change the dict from
+empty to non-empty or vice-versa.
+
+Iterating over the keys of a dict doesn't have to conflict with other
+transactions that only change existing values.  Iterating over the
+values or the items conflict with other transactions doing any write at
+all.
+
+
+Model
+=======
+
+We can use the following idea to give a theoretical model of the
+above:
+
+Let H = {0, ... 2**64-1} be the set of possible hashes.  A dictionary is
+an array of length 2**64, where each item contains a "line" of zero or
+more key/value pairs.  We have STM read and write markers as follows:
+
+* for every key/value pair, we have two markers (a read and a write) on
+  the "value";
+
+* for every line (i.e. for every possible hash value), we also have two
+  markers (a read and a write) on the line itself.
+
+Then:
+
+* Reading or writing the value associated with an existing key accesses
+  the read marker of the line, and the read or write marker of that
+  particular value.
+
+* Checking for the presence of a key only accesses the read marker of
+  the line.
+
+* Creating a new key accesses the write marker of the line (the write
+  marker of the newly added value is not relevant then, because other
+  transactions won't be able to access the line anyway).
+
+* Deleting a key also accesses the write marker of the line.  (We cannot
+  do it by pretending the write the value NULL, so accessing only the
+  write marker of the value, because then it wouldn't conflict with
+  another transaction that checks for the presence of the key by
+  accessing only the read marker of the line.)
+
+* Global operations, like getting the list of keys, work by mass-marking
+  all the lines in H (all 2**64 of them, so obviously it needs to be
+  special-cased in the implementation).  More precisely, len(), keys(),
+  etc., sets all the lines' read markers; clear() sets all the lines'
+  write markers.
diff --git a/hashtable/design2.txt b/hashtable/design2.txt
deleted file mode 100644
--- a/hashtable/design2.txt
+++ /dev/null
@@ -1,124 +0,0 @@
-Goal
-======
-
-The goal is to have dictionaries where a read-write or write-write
-conflict does not cause aborts if they occur with keys that have
-different 64-bit hashes.
-
-(We might prefer the condition to be "on different keys even in case of
-hash collision", but that's hard to achieve in general: for Python
-dicts, "to be equal" involves calling the __eq__() method of objects.)
-
-We distinguish between reading a value associated to a key, and only
-checking that the key is in the dictionary.  It makes a difference if
-a concurrent transaction writes to the value.
-
-Some operations on a dict (particularly len() and __nonzero__()) involve
-the global state of the dict, and thus cause conflict with any write
-that adds or removes keys.  They should not cause conflicts with reads,
-or with writes that only change existing values.
-
-In theory, we might refine this to: a len() only conflicts with a
-different transaction whose net effect is to change the length (adding 3
-keys and removing 3 other keys is fine); and a __nonzero__() only
-conflicts with a transaction whose net effect is to change the dict from
-empty to non-empty or vice-versa.  The latter is probably more important
-than the former, so we'll ignore the former.
-
-Iterating over the keys of a dict doesn't have to conflict with other
-transactions that only change existing values.  Iterating over the
-values or the items conflict with other transactions doing any write at
-all.
-
-
-Idea
-======
-
-A dict is implemented used two distinct parts: the committed part,
-and the uncommitted one.  Each part is optimized differently.
-
-
-Committed part
---------------
-
-The committed part uses separate chaining with linked lists.  It is an
-array of pointers of length some power of two.  From the hash, we access
-item (hash & (power_of_two - 1)).  We get a pointer to some Entry
-object, with fields "hash", "key", "value", and "next".  The "hash"
-field stored in the Entry objects is the full 64-bit hash.  The "next"
-field might point to more Entry objects.
-
-This whole structure is only modified during commit, by special code not
-subject to the normal STM rules.  There is only one writer, the
-transaction currently trying to commit; but we need to be careful so that
-concurrent reads work as expected.
-
-For the sequel, the committed part works theoretically like an array of
-length 2**64, indexed by the hash, where each item contains zero of more
-Entry objects with that hash value.
-
-
-Uncommitted part
-----------------
-
-For the uncommitted part we can use a hash table similar to the one used
-for RPython dicts, with open addressing.  We import data from the
-committed part to this uncommitted part when needed (at the granularity
-of a 64-bit hash value).  More precisely, the uncommitted part can be in
-one of these states:
-
-* It can be a freshly created dictionary, with no committed part yet.
-  That's the easy case: the uncommitted hash table is all we need.
-
-* Or, we have a committed part, and we have imported from it
-  zero or more 64-bit hash values.  We need to remember which ones.
-  That includes the imports that yielded zero key/value pairs.  For each
-  imported hash value, we make (zero or more) entries in the uncommitted
-  part where we copy the key, but where the value is initially missing.
-  The value is copied lazily, with another lookup that will mark the
-  Entry object as "read" in the usual STM sense.
-
-* We may have additionally imported the "emptiness" or "non-emptiness"
-  of the committed part.
-
-* Last option: the current transaction is depending on the exact set
-  of committed keys.  We no longer need to remember which ones
-  individually.  This state is equivalent to having imported *all*
-  possible 64-bit hash values.
-
-
-Commit time
------------
-
-At commit time, we need to do these extra steps.  The points followed by
-(*) need to be done carefully because concurrent threads might be
-reading the same data.
-
-* First, we do the usual STM validation.  It will detect read-write
-  and write-write conflicts on existing values thanks to the read and
-  write markers xxx
-
-* We validate the keys: for every imported hash value, we check that
-  importing it now would give us the same answer as it did previously
-  (i.e. the committed table has got the same set of keys with this
-  particular hash as it did previously).
-
-* For key/value pairs that have been newly added by the current
-  transaction, the validation above is enough too: to add a key/value
-  pair, we must have imported all Entries with the same hash anyway.
-  So at this point we only need to create and attach(*) new Entry
-  objects for new key/value pairs.
-
-
-xxxxxxxxxxx
-
-
-
-* First, we create new Entry objects for all key/value pairs that
-  are created by the current transaction.
-
-* First, we create or update the Entry objects for all key/value
-  pairs that have been modified by the current transaction.  We
-  store the new ones by carefully changing the array of pointers.
-
-*