[Mailman-Developers] Requirements for a new archiver

Wed Oct 29 17:33:19 EST 2003

On Wed, 29 Oct 2003 20:11:49 +0100 
Brad Knowles <brad.knowles at skynet.be> wrote:
> At 1:30 PM -0500 2003/10/29, J C Lawrence wrote:

>> 2) Message IDs are not guaranteed globally unique, but the collision
>> rate can be manageable/acceptable in a large number of deployment
>> cases.

> Outside of a database, this may be something you can decide whether or
> not to live with.  Within the confines of a database, this simply is
> not possible.

Of course, and that's the point.  We are in violent agreement.

> The ANSI SQL specification has some hard requirements for a primary
> index key:

I know, but that's not what I'm asserting.  I'll also ignore the DB
types which don't require primary keys of any form, as that's
essentially what we have now and we're assuming an indexed store
instead.

>> We don't have to guarantee key uniqueness for all messages BEFORE
>> they are submitted to the message store.

> All other keys could potentially be non-unique, or null, but not
> the primary index key.  

Ahh, I think I see the disjoint.  We're using "key" in two contexts
without distinguishing between them:

  1) The property of a message which identifies that message with a high
  probability of uniqueness.  This can be a Message ID, MD5SUM,
  whatever, but it is not guaranteed unique, it merely is unique most of
  the time for large definitions of "most".

  2) The primary key as used in an indexed DB or other store which is
  guaranteed unique for all cases.

Between the two there's a conflict.  One requires perfect uniqueness.
The other delivers merely a good Best Effort.  The assertion is that we
don't always have to solve that mismatch.  We can elect to live with the
collisions.

> This is why many applications have the database assign the primary
> index key itself on insertion into the table, so that all the
> necessary requirements can be met.

Sure, except that doing that in our case requires that storage be a
synchronous operation (otherwise we don't know the key at
rewrite/delivery time).  That would a significant change from the
current model and rather unfriendly to a wide range of deployment cases.
Keeping the storage procedure asynchronous with an a-priori key (for
whatever guarantee of uniqueness) makes for a more interesting system.

>> I'm neither an idiot or a neophyte in this game.  Yes, a database
>> needs a primary unique key.

> Then you must realize that we could not possibly use message-id as the
> primary index key, unless this is a field that we generate ourselves
> in such a way that all the necessary requirements are met.

No, I don't realise that because it is false.  We can use Message IDs as
the primary key right now, today.  In fact, I am, right now, this
minute, today.  You are assuming that every message submitted to the
store must be accepted by the store.  That is an assumption that hasn't
been defined as a requirement and which some evidence suggests isn't a
hard requirement.  A very small percentage of the messages I submit to
my store don't make it.  They have duplicate Message IDs.  They run
through Mailman just fine.  They never reach my list archives.  I know,
expect, accept this.

The primary key has to be unique for every message IN THE STORE.
Accepted.  That does not dictate that the primary key for every message
SUBMITTED to the store has to be unique (not that key assignment is
occurring before collision check), or that the store has to ACCEPT every
message which is submitted to it. Guaranteeing perfect uniqueness of the
keys prior to submission to the store is fragile and expensive.  It is
tempting to do some form of very good approximation (cg Chuq's MD5SUM).
Without perfect synchrony with the store's keys. if we calculate keys
prior to insertion, or merely accept the keys that are given us in the
form of Message IDs we're going to get occasional collisions.

The question is how to handle messages whose a priori assigned keys
collide with keys already in the store.  We can handle the collision
case in several ways:

  1) Ignore it and discard messages bearing colliding keys.

  2) Best Effort attempt to guarantee uniqueness within a window, with
  collisions outside the window discarded.

  3) Fully guarantee uniqueness.

The first is easy.  The second is fairly easy.  The third isn't trivial.

In all three cases the population of key values in the store remains
unique.  Its just that the population of keys submitted to the store may
or may not be unique.  Lossage at the insertion layer can be acceptable.

>> Let's at least be on the same page.

> 	Agreed.

Cool.

-- 
J C Lawrence                
---------(*)                Satan, oscillate my metallic sonatas. 
claw at kanga.nu               He lived as a devil, eh?		  
http://www.kanga.nu/~claw/  Evil is a name of a foeman, as I live.