[pypy-commit] stmgc default: Uh, sorry, 'design.txt' is the one I meant to commit...
arigo
noreply at buildbot.pypy.org
Sun Sep 28 16:52:26 CEST 2014
Author: Armin Rigo <arigo at tunes.org>
Branch:
Changeset: r1434:57c2c6d8f2ef
Date: 2014-09-28 16:52 +0200
http://bitbucket.org/pypy/stmgc/changeset/57c2c6d8f2ef/
Log: Uh, sorry, 'design.txt' is the one I meant to commit...
diff --git a/hashtable/design.txt b/hashtable/design.txt
new file mode 100644
--- /dev/null
+++ b/hashtable/design.txt
@@ -0,0 +1,72 @@
+Goal
+======
+
+The goal is to have dictionaries where a read-write or write-write
+conflict does not cause aborts if they occur with keys that have
+different 64-bit hashes.
+
+(We might prefer the condition to be "on different keys even in case of
+hash collision", but that's hard to achieve in general: for Python
+dicts, "to be equal" involves calling the __eq__() method of objects.)
+
+We distinguish between reading a value associated to a key, and only
+checking that the key is in the dictionary. It makes a difference if
+a concurrent transaction writes to the value.
+
+Some operations on a dict (particularly len() and __nonzero__()) involve
+the global state of the dict, and thus cause conflict with any write
+that adds or removes keys. They should not cause conflicts with reads,
+or with writes that only change existing values.
+
+In theory, we might refine this to: a len() only conflicts with a
+different transaction whose net effect is to change the length (adding 3
+keys and removing 3 other keys is fine); and a __nonzero__() only
+conflicts with a transaction whose net effect is to change the dict from
+empty to non-empty or vice-versa.
+
+Iterating over the keys of a dict doesn't have to conflict with other
+transactions that only change existing values. Iterating over the
+values or the items conflict with other transactions doing any write at
+all.
+
+
+Model
+=======
+
+We can use the following idea to give a theoretical model of the
+above:
+
+Let H = {0, ... 2**64-1} be the set of possible hashes. A dictionary is
+an array of length 2**64, where each item contains a "line" of zero or
+more key/value pairs. We have STM read and write markers as follows:
+
+* for every key/value pair, we have two markers (a read and a write) on
+ the "value";
+
+* for every line (i.e. for every possible hash value), we also have two
+ markers (a read and a write) on the line itself.
+
+Then:
+
+* Reading or writing the value associated with an existing key accesses
+ the read marker of the line, and the read or write marker of that
+ particular value.
+
+* Checking for the presence of a key only accesses the read marker of
+ the line.
+
+* Creating a new key accesses the write marker of the line (the write
+ marker of the newly added value is not relevant then, because other
+ transactions won't be able to access the line anyway).
+
+* Deleting a key also accesses the write marker of the line. (We cannot
+ do it by pretending the write the value NULL, so accessing only the
+ write marker of the value, because then it wouldn't conflict with
+ another transaction that checks for the presence of the key by
+ accessing only the read marker of the line.)
+
+* Global operations, like getting the list of keys, work by mass-marking
+ all the lines in H (all 2**64 of them, so obviously it needs to be
+ special-cased in the implementation). More precisely, len(), keys(),
+ etc., sets all the lines' read markers; clear() sets all the lines'
+ write markers.
diff --git a/hashtable/design2.txt b/hashtable/design2.txt
deleted file mode 100644
--- a/hashtable/design2.txt
+++ /dev/null
@@ -1,124 +0,0 @@
-Goal
-======
-
-The goal is to have dictionaries where a read-write or write-write
-conflict does not cause aborts if they occur with keys that have
-different 64-bit hashes.
-
-(We might prefer the condition to be "on different keys even in case of
-hash collision", but that's hard to achieve in general: for Python
-dicts, "to be equal" involves calling the __eq__() method of objects.)
-
-We distinguish between reading a value associated to a key, and only
-checking that the key is in the dictionary. It makes a difference if
-a concurrent transaction writes to the value.
-
-Some operations on a dict (particularly len() and __nonzero__()) involve
-the global state of the dict, and thus cause conflict with any write
-that adds or removes keys. They should not cause conflicts with reads,
-or with writes that only change existing values.
-
-In theory, we might refine this to: a len() only conflicts with a
-different transaction whose net effect is to change the length (adding 3
-keys and removing 3 other keys is fine); and a __nonzero__() only
-conflicts with a transaction whose net effect is to change the dict from
-empty to non-empty or vice-versa. The latter is probably more important
-than the former, so we'll ignore the former.
-
-Iterating over the keys of a dict doesn't have to conflict with other
-transactions that only change existing values. Iterating over the
-values or the items conflict with other transactions doing any write at
-all.
-
-
-Idea
-======
-
-A dict is implemented used two distinct parts: the committed part,
-and the uncommitted one. Each part is optimized differently.
-
-
-Committed part
---------------
-
-The committed part uses separate chaining with linked lists. It is an
-array of pointers of length some power of two. From the hash, we access
-item (hash & (power_of_two - 1)). We get a pointer to some Entry
-object, with fields "hash", "key", "value", and "next". The "hash"
-field stored in the Entry objects is the full 64-bit hash. The "next"
-field might point to more Entry objects.
-
-This whole structure is only modified during commit, by special code not
-subject to the normal STM rules. There is only one writer, the
-transaction currently trying to commit; but we need to be careful so that
-concurrent reads work as expected.
-
-For the sequel, the committed part works theoretically like an array of
-length 2**64, indexed by the hash, where each item contains zero of more
-Entry objects with that hash value.
-
-
-Uncommitted part
-----------------
-
-For the uncommitted part we can use a hash table similar to the one used
-for RPython dicts, with open addressing. We import data from the
-committed part to this uncommitted part when needed (at the granularity
-of a 64-bit hash value). More precisely, the uncommitted part can be in
-one of these states:
-
-* It can be a freshly created dictionary, with no committed part yet.
- That's the easy case: the uncommitted hash table is all we need.
-
-* Or, we have a committed part, and we have imported from it
- zero or more 64-bit hash values. We need to remember which ones.
- That includes the imports that yielded zero key/value pairs. For each
- imported hash value, we make (zero or more) entries in the uncommitted
- part where we copy the key, but where the value is initially missing.
- The value is copied lazily, with another lookup that will mark the
- Entry object as "read" in the usual STM sense.
-
-* We may have additionally imported the "emptiness" or "non-emptiness"
- of the committed part.
-
-* Last option: the current transaction is depending on the exact set
- of committed keys. We no longer need to remember which ones
- individually. This state is equivalent to having imported *all*
- possible 64-bit hash values.
-
-
-Commit time
------------
-
-At commit time, we need to do these extra steps. The points followed by
-(*) need to be done carefully because concurrent threads might be
-reading the same data.
-
-* First, we do the usual STM validation. It will detect read-write
- and write-write conflicts on existing values thanks to the read and
- write markers xxx
-
-* We validate the keys: for every imported hash value, we check that
- importing it now would give us the same answer as it did previously
- (i.e. the committed table has got the same set of keys with this
- particular hash as it did previously).
-
-* For key/value pairs that have been newly added by the current
- transaction, the validation above is enough too: to add a key/value
- pair, we must have imported all Entries with the same hash anyway.
- So at this point we only need to create and attach(*) new Entry
- objects for new key/value pairs.
-
-
-xxxxxxxxxxx
-
-
-
-* First, we create new Entry objects for all key/value pairs that
- are created by the current transaction.
-
-* First, we create or update the Entry objects for all key/value
- pairs that have been modified by the current transaction. We
- store the new ones by carefully changing the array of pointers.
-
-*
More information about the pypy-commit
mailing list