[Mailman-Developers] From the creation of a ThreadID

Thu Apr 5 20:42:51 CEST 2012

On Fri, 2012-04-06 at 00:10 +0900, Stephen J. Turnbull wrote:
> On Thu, Apr 5, 2012 at 10:41 PM, Pierre-Yves Chibon <pingou at pingoured.fr> wrote:
> 
> > In HyperKitty to be able to easily retrieve from the database all the
> > threads of a given month or just all the emails of a thread, I created a
> > Field in the database called ThreadID.
> > When I load the archives from mailman into mongo, I look for the absence
> > of the headers 'References' or 'In-Reply-To' to define an email that
> > starts a new thread.
> 
> This fails when a thread crosses channels.  Eg,
> 
> To: Pierre
> From: Steve
> Message-Id: <x at y.z>
> 
> is followed by
> 
> To: Steve
> From: Pierre
> Cc: SomeList
> References: <x at y.z>
> Message-Id: <a at b.c>
> 
> > Would anyone have an idea on how to generate a stable and delete/reload
> > proof ThreadID?
> 
> I don't see how this can be possible.  Eg, in the above scenario you
> construct a thread based on your reply to me.  Then I go, "oh, really
> I should have posted to mm-dev" and repost the thread.  So the
> "Message-ID of root message" fails, and I don't see an alternative
> that can be predicted.  So it may as well be arbitrary (eg, any
> message in the thread) and stored in the database with appropriate
> linkage from thread IDs to message IDs (one-to-many), and vice versa
> (many-to-one).

Ok, I missed a something here.
So when it parses the email, it checks for 'References' or
'In-Reply-To'.
- If it finds them, it looks for the preceding email
    - if it finds the preceding email, then the current email gets the
ThreadID from the preceding email
    - if it does not find the preceding email, then the current email is
assumed to be a new thread and thus its ThreadID is its Message-ID
- if it does not find 'References' or 'In-Reply-To', then the current
email is assumed to be a new thread and thus its ThreadID is its
Message-ID

So for the example you give, the archiver will receive your email and
make a new thread out of it.

> > The other solution of course being that I regenerate the thread on the
> > fly based on the first email (which is still easy to find), but that
> > will be a lot of db querying.
> 
> I haven't thought about it deeply, but I would say just give the
> thread an arbitrary ID in the database.  Message-IDs are supposed to
> universally unique, so what's wrong with keeping the thread in the
> database as a tree of message IDs?  Some Message-IDs will not have
> corresponding messages but that's always a problem with threading (see
> http://www.jwz.org/doc/threading.html, and RFC 5256).

The idea of using the Message-ID for ThreadID (instead of a integer) is
that, if I whether I load one months or two months of archives into the
database, the link to the thread
(http://mm3test.fedoraproject.org/thread/packaging@fp.o/XU7HT5JC5GND2O4JII7MTQILLTB4IN4S) will remain the same (so consistent urls).

> There are other problems with threading that need to be dealt with as
> well, such as References being inconsistent across messages in the
> same thread and people who continue a thread with a new message, etc.

For these I am not sure I can do something (at least automatically, we
could always allow an admin to edit the field).

Pierre