[Email-SIG] API for email threading library?

Bill Janssen janssen at parc.com
Wed Jan 11 20:00:39 CET 2012


Here's what I've got so far.  Comments would be appreciated.

Bill

======================================================================

This module implements email threading per RFC 5256.

It provides four classes: ThreadableObjectStore, MailboxStore,
ReferencesSet, and OrderedSubjectSet.

To use it, you need to provide it with a "mailstore", and a set of
messages to thread.  The mailstore must be a subclass of the
abstract class ThreadableObjectStore; an implementation of a
ThreadableObjectStore for mailbox.Mailbox is provided, as the class
MailboxStore.  Four methods must be implemented for a new
ThreadableObjectStore subclass:

  tos_get_message_id(msg or message ID) => message ID

    where the message ID is an immutable value that must be unique in
    that ThreadableObjectStore context, and the msg can be whatever
    that ThreadableObjectStore considers a message.

  tos_get_subject(msg or message ID) => subject

    where the subject is the subject of the message, or None

  tos_get_date (msg or message ID) => timestamp

    where the timestamp is the date and time of the message, expressed
    as a standard Python time.time() value

  tos_get_references (msg or message ID) => sequence of message ID

    where the references are a sequence of message IDs, arranged in
    order as per RFC 5322.  These message IDs must be in the same
    format as the message ID returned by tos_get_message_id().

The base ThreadableObjectStore class also provides a class method to
compute the RFC 5256 "base subject":

  ThreadableObjectStore.tos_base_subject (subject text) => \
        subject, is_reply_or_forward

    Takes a standard Subject: header value, and returns the "base
    subject" for it, along with a boolean flag indicating whether the
    supplied subject indicated a reply to or forward of the original
    subject

To develop a set of threads, you then instantiate either ReferencesSet
(the JWS algorithm from Netscape, formalized in RFC 5256), or
OrderedSubjectSet (the "same subjects" algorithm, aka "poor man's
threading"), both subclasses of the abstract class ThreadSet.  Each
constructor takes a ThreadableObjectStore instance and optionally a
set of messages to use for the initial threads.  If provided, those
messages are analyzed into a set of threads.  The threadset is
iterable; the iteration is over the threads it contains.

An instance of ThreadSet provides the following methods:

  add (msg or message ID) => thread

    add another message from the mailstore to the thread set, where
    "thread" is an object which has the attributes "message_id" (a
    string) and "children" (an ordered list of sub-threads), and is
    the root of the thread tree for that msg.

  remove (msg or message ID) => thread

    remove a message from the thread set, where thread is as for
    "add()", but may additionally be 'None' if the message was not in
    a thread, or was the only message in the thread.

  thread (msg or message ID) => thread

    obtain the thread containing the specified message, if any,
    where "thread" is as for "add()", or 'None' if no thread for
    that message exists.

  subject_threads (subject regexp) => set of thread

    obtain the threads where the base subject of the thread contains
    the specified regular expression, where "regexp" is a textual or
    compiled regular expression, and the return value is a set of
    threads.  Note that subject comparisons are case-insensitive;
    compiled regexps must use the re.IGNORECASE flag.

  date_threads (starting time, ending time, root_only=False) => set of thread

    obtain the set of threads containing any messages between
    the two timestamps.  Timestamps are time.time() timestamps;
    either may be specified as 'None' to mean either the start
    of time, or the distant future, respectively.  If "root_only"
    is specified, will only consider the dates of the roots of
    each thread; threads with no root message (a subject forest)
    will always fail to match in this case.

  __contains__ (msg or message ID) => boolean

    Present to support the "in" operator.

Support for persistence is provided with an instance method
"to_external_form" and a class method "from_external_form" on thread
sets.  Calling "to_external_form" on a thread set instance will
generate a set of tree structured nested tuples, where each tuple
consists of an optional message ID followed by zero or more child
tuples.  ReferencesSet and OrderedSubjectSet also provide a class
method, "from_external_form", which given a ThreadableObjectStore
instance and an externalized thread set value, will create and return
a new thread set instance initialized to that set of threads.

MailboxStore is a subclass of ThreadableObjectStore designed to
wrap mailboxes (subclasses of mailbox.Mailbox).  For instance,

  >>> mbox = mailbox.Mbox("foo.mbox")
  >>> mboxstore = MailboxStore(mbox)
  >>> threadset = ReferencesSet (mboxstore, mbox.itervalues())

will produce a thread set for all the messages in the mbox-format
mailbox 'foo.mbox', using the REFERENCES threading algorithm.

MailboxStore also provides a static method to compute the normalized
form of a message ID (the message ID stripped of <> angle brackets,
and various quoted parts unquoted):

  MailboxStore.normalize_message_id(message ID) => message ID

    Take a standard RFC 5322 message ID string and return the
    normalized form of it.


More information about the Email-SIG mailing list