[Email-SIG] API for email threading library?

Barry Warsaw barry at python.org
Fri Jan 6 02:21:08 CET 2012


On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote:

>Folks, I'm working on an implementation of RFC 5256 email threading,
>designed so that it could fit as a submodule in the "email" package, if
>such a think was ever seen to be useful.

I really like the idea of threading support being included in the email
package.  (I admit that I don't have time right now to read the RFC.)  My
general thoughts are that the actual messages needn't be included in the
thread collection, but perhaps just Message-IDs.  That would allow an
application to store the actual message objects anywhere they want, and would
reduce space requirements of the thread collection.

>I'd like to ask "the wisdom of the crowd" what they think an appropriate
>interface to such a thing would be?  The basic operation is that you
>create a collection (type C) of email threads (type T) by passing a set
>of messages (type M) to the constructor.
>
>* Should M be required to be "email.message.Message", or perhaps some
>  less restrictive type, say "ThreadableMessageAPI"?  All that's
>  strictly required is the ability to retrieve the Message-ID, Subject,
>  Date, References, and In-Reply-To fields.

I think it would be fine then to allow duck-typing of the input objects.  I
don't have a sense of whether it needs a formal (as in Python's ABCs)
interface type.

>* What operations should be possible on C?  Some that come to mind:
>
>  * retrieve_thread (M or message-id) => T

Message-ID as input.

>  * add_message (M) => T

Duck-typed message.

>  * add_messages (set of M) => None
>  * remove_message (M or message-id) => T (or None) ?

Probably Message-ID as the input.  I guess the rule would be that if you need
all the headers you mention above, a duck-typed message would be required.
For operations that only need the Message-ID, just accept that.

And you probably want the full Message-ID header value, e.g. it would include
the angle brackets.

>* What's the interface for T?  It's a tree with possible dummy nodes, so
>  a tuple of messages plus nested tuples would do it.  What should the
>  nodes in the tree be?  Normalized (see RFC 5256) Message-IDs?
>  email.message.Message instances?

Will the tree get mutated when a message is added in the middle of a thread,
or will you generate a new tree?  That would make a difference for
tuple-of-tuples or list-of-lists.

I think the nodes would be Message-IDs, but you'd need a public API for
normalizing them, and my application would have to make sure that my messages
are normalized (or at least the lookup keys are) or I might not be able to
find a message given its normalized id.  OTOH, maybe the message parser or
message object itself should provide an API for normalizing ids?

Let's think about some use cases.

- given any message, find the entire thread it's a part of
- given a message, find all children
- given a message, find a path to the root of the thread
- find the parts of the thread that fall within a date range
- find the parts of a thread with a matching subject

>* For large sets of threads (millions of messages) a persistence
>  mechanism would be useful.  Should there be a standard interface to
>  such a mechanism, perhaps as class methods on C?  If so, what should
>  it look like?  Should the implementation contain a default persistent
>  subclass of C, based on sqlite3?  What side-effects would persistence
>  requirements have on the other design considerations?  For instance,
>  would you have to save the entire text of a message for each node?
>  Just the headers?  Just some of the headers?  Just the Message-ID?

Great questions.  We've long talked about a persistence mechanism for message
parts (e.g. store the big binary parts on disk instead of in memory).  Some
consistency of design would be good here.  But I agree that persistence should
definitely be part of the story, and it needs to be plugable.

Have to think more about this, but a big +1 for the idea.  It would serve as a
very good component for the ideas I have about a next generation email
archiver.

-Barry


More information about the Email-SIG mailing list