[Persistence-sig] A simple Observation API

Phillip J. Eby pje@telecommunity.com
Tue, 30 Jul 2002 15:05:39 -0400


At 02:40 PM 7/30/02 -0400, Shane Hathaway wrote:
>On Tue, 30 Jul 2002, Phillip J. Eby wrote:
>>
>> This has to do with the "write-through mode" phase between
>> "prepareToCommit()" and "voteOnCommit()" messages (whatever you call them).
>>  During this phase, to support cascaded storage (one data manager writes to
>> another), all data managers must "write through" any changes that occur
>> *immediately*.  They can't wait for "prepareToCommit()", because they've
>> already received it.  Basically, when the object says, "I've changed"
>> (i.e. via "register" or "notify" or whatever you call it), the data manager
>> must write it out right then.
>
>I'm having trouble understanding this.  Is prepareToCommit() the first
>phase, and voteOnCommit() the second phase?  Can't the data manager commit
>the data on the second phase?

They're messages, not phases.  The phase is the period between messages.

Let's say we have DM1, DM2, and DM3, and the transaction calls:

DM2.prepare()
DM3.prepare()
DM1.prepare()

DM2.vote()
DM3.vote()
DM1.vote()

If DM1 writes to DM3, and DM3 writes to DM2, then this ordering doesn't
work, unless you have a "write-through" phase between prepare() and vote().
 That is, if DM3 goes into "write-through" mode when it receives prepare(),
then it will write through to DM2 when DM1 writes to it during the
DM1.prepare() method.


>> But, if the _p_changed flag is set *before* the change, the data manager
>> has no way to know what the change was and write it.  It can't wait for
>> "voteOnCommit()", because then the DM it writes to might have already
>> voted, for example.  It *must* know about the change as soon as the change
>> has occurred.  Thus, the change message must *follow* a change.  It's okay
>> if there are multiple change messages, as long as there's at least one
>> *after* a set of changes.
>
>For ZODB 3 I've realized that sometimes application code needs to set
>_p_changed *before* making a change.  Here is an example of potentially
>broken code:
>
>def addDate(self, date):
>    self.dates.append(date)  # self.dates is a simple list
>    self.dates.sort()
>    self._p_changed = 1
>
>Let's say self.dates.sort() raises some exception that leads to an aborted
>transaction.  Objects are supposed to be reverted on transaction abort,
>but that won't happen here!  The connection was never notified that there
>were changes, so self.dates is now out of sync.  But if the application
>sets _p_changed just *before* the change, aborting will work.

Good point.  I hadn't really thought about that use case.  But the
Observation API I proposed does support it, via separate
beforeChange()/afterChange() notifications.  A DM could track
beforeChange() to know that an object needs rolling back, and
afterChange(), to actually send a change through to an underlying DB, if
it's in write-through mode at the time.


>> Now, you may say that there are other ways to address dependencies between
>> participants than having "write-through mode" during the prepare->vote
>> phase.  And you're right.  ZPatterns certainly manages to work around this,
>> as does Steve Alexander's TransactionAgents.  TransactionAgents, however,
>> is actually a partial rewrite of the Zope transaction machinery, and there
>> are some holes in how ZPatterns addresses the issue as well.  (ZPatterns
>> addresses it by adding more objects to the transaction during the
>> "commit()" calls to the data managers, that are roughly equivalent to the
>> current "prepare()" message concept.)
>>
>> We could address this by having transaction participants declare their
>> dependencies to other participants, and have the transaction do a
>> topological sort, and send all messages in dependency order.  It could then
>> be an error to have a circular dependency, and data managers could raise an
>> error if they received an object change message once they were done with
>> the prepare() call.  It would make the Transaction API and implementation a
>> bit more complex, leave data managers about the same in complexity as they
>> would have been before, and it would mean that persistent objects wouldn't
>> need to worry about whether _p_changed was flagged before or after a
change.
>
>Are you alluding to "indexing agents" and "rule agents" like we talked
>about before?  

That's what TransactionAgents does, but that's not what I'm looking for per
se.  I'm looking at simple data managers.  For example, if I make a data
manager that persists a set of objects to an XML DOM, I might want to use
it with a DOM persistence manager that stores XML documents in an SQL
database.  All three "data managers" (persist->XML, XML->Database, SQL
database) are transaction participants, with implied or actual ordering.


>I think we do need some kind of transaction participant
>ordering to support those concepts.  I had in mind a simple numerical
>prioritization scheme.  Is the need complex enough to require topological
>sorting?

Numerical prioritization requires that you have global knowledge of the
participants, and therefore seems to go against modular usage of
components, such as in my example above.

Certainly, any non-circular topological relationship can be reduced to a
numerical ordering.  After all, Python new-style classes do it in __mro__.
A topological sort using the kjbuckets module is maybe 30-40 lines of
Python code, however; not much to pay, IMHO, for the amount of debugging
saved by those people who would otherwise be tearing their hair out trying
to figure out why something is intermittently failing because they gave two
items the same numerical priority, but sometimes one of them is going first
and sometimes the other one is.

The post-change flag approach I proposed has the advantage of determining
dependencies dynamically; that is, only dependencies that actually exist
will have an effect, and explicit management through priorities or
dependencies isn't required.  In terms of API, I'd much rather deal with
the overhead of before/after change notifications (as in my suggested
Observation API) than have to explicitly declare priorities or
dependencies.  I can much more easily verify (by testing or local code
inspection) that my object obeys the observation API, than I can debug
*global* and *dynamic* interaction dependencies.

So in my opinion, I'd *much* rather put up with the wrapper overhead on
write methods, than deal with the global debug nightmares that declaring
dependencies or priorities between data managers is (again, in my opinion)
likely to bring.  Such issues are harder for novice developers to
understand.  If their class works correctly, they reason, so too should my
application.  All the components worked individually, why won't they work
together?  IMO, the principle of least surprise says they should just work,
without needing to wave any additional dead chickens over the code.