[Mailman-Developers] From the creation of a ThreadID

Fri Apr 6 05:00:32 CEST 2012

On Fri, Apr 6, 2012 at 3:42 AM, Pierre-Yves Chibon <pingou at pingoured.fr> wrote:

> So when it parses the email, it checks for 'References' or
> 'In-Reply-To'.
> - If it finds them, it looks for the preceding email
>    - if it finds the preceding email, then the current email gets the
> ThreadID from the preceding email

So far, so good.

>    - if it does not find the preceding email, then the current email is
> assumed to be a new thread

This is unacceptable.  Mailing lists are not synchronous (eg, because
of greylisting for one, but there are plenty of reasons why the mail
doesn't always go through immediately).  Threads must be able to
integrate new messages as they arrive, even if out of order.

> and thus its ThreadID is its Message-ID
> - if it does not find 'References' or 'In-Reply-To', then the current
> email is assumed to be a new thread and thus its ThreadID is its
> Message-ID

This isn't quite unacceptable, but it's clearly suboptimal.
(Well-known algorithms that handle this case nicely are available.)

> So for the example you give, the archiver will receive your email and
> make a new thread out of it.

That's an archiver that I won't use, and will strongly oppose as a
candidate for the bundled archiver for Mailman (any version).

>> I haven't thought about it deeply, but I would say just give the
>> thread an arbitrary ID in the database.  Message-IDs are supposed to
>> universally unique, so what's wrong with keeping the thread in the
>> database as a tree of message IDs?  Some Message-IDs will not have
>> corresponding messages but that's always a problem with threading (see
>> http://www.jwz.org/doc/threading.html, and RFC 5256).
>
> The idea of using the Message-ID for ThreadID (instead of a integer) is
> that, if I whether I load one months or two months of archives into the
> database, the link to the thread
> (http://mm3test.fedoraproject.org/thread/packaging@fp.o/XU7HT5JC5GND2O4JII7MTQILLTB4IN4S) will remain the same (so consistent urls).

Sure, but this is a matter of a persistent ID in the database.  When I
say "arbitrary" I don't mean you can't use a message ID to represent a
thread if you like, I mean that you can't algorithmically compute it
in a reliable, history-independent way.  From the point of view of a
user, you can't even be sure that a message without References or
In-Reply-To is a thread root (users will note the subject and the
content, and they will be displeased with any threading algorithm that
doesn't at least group subjects).

I don't say you need to implement that part of the JWZ/5256 algorithm
immediately, but you must not use a database schema that makes it hard
to add that feature later.

In most cases, users will have access to a Message-ID for some message
in the thread.  So I would want an URL like

    http://lists.example.com/archive/some-list/thread/MessageID/root/

to find the thread root for any message in the thread, not just a
particular representative of the the thread.  (YMMV for the URL
scheme, of course.)  The last component of the URL path just gives the
focus (message to actually display and/or highlight in a tree widget);
other useful focuses might be "latest" (a message in the thread with
the most recent Date or Received header) and "self" (the message
itself is the focus).  More speculative focuses would be "parent"
(obvious, I hope) and "node" (the most recent ancestor message with
multiple children).

>> There are other problems with threading that need to be dealt with as
>> well, such as References being inconsistent across messages in the
>> same thread and people who continue a thread with a new message, etc.
>
> For these I am not sure I can do something (at least automatically, we
> could always allow an admin to edit the field).

You must do something about inconsistent References.  Suppose there is
a References loop?  It needs to be broken, somehow, or your program
will infloop.

Anyway, this is all already taken care of in Jamie's algorithm.