[Archiver-dev] UpLib and archiving

Sun Oct 17 23:01:12 CEST 2010

Just noticed this list, and thought I'd sign up.

I build the UpLib archive system, at http://uplib.parc.com/.

The latest release includes new support for building very large
archives.  UpLib has some support for email archiving already, including
thread analysis and a built-in IMAP server, but that support needs to be
re-worked for efficiency to support large archives.  So I'm thinking
about that just now.

Some topics:

1.  An email thread analysis library which works on a mixin, say
    ThreadableEmail, so that different email packages could use it.

2.  Support for multipart/related parsing.

3.  Indexing for search.  UpLib currently indexes email into PyLucene
    with the following fields:

      date (untokenized)
      contents (tokenized -- just the body text, not the headers)
      email-message-id (untokenized)
      email-guid (untokenized -- a hash of the message-id)
      email-subject (tokenized)
      email-from-name (tokenized, only used if present)
      email-from-address (untokenized)
      email-attachment-to (untokenized, for attachments, guid of message)
      email-thread-index (untokenized, thread ID)
      email-references (untokenized, zero or more email-guids)
      email-in-reply-to (untokenized, zero or more email-guids)
      email-recipient-names (untokenized [should be tokenized])
      email-recipients (untokenized -- who the message was sent to)

    Attachments are extracted, and indexed separately, with links from the
    attachment to the message, and links from the message to its
    attachments.  This is a nice feature of UpLib over more specifically
    mail-archiving systems -- it can also archive images, Word, PDF, etc.,
    and do proper metadata indexing on all of the various types.

    It also tries to leverage Lucene's multi-language support, by
    running a language guesser over the text of the email, and selecting
    the Lucene Analyzer which most closely matches that language.

    So, is this a good list of indexing fields?  Bad list?  Where does
    the Dublin Core factor into this?

4.  Archive server frameworks.  My IMAP server is currently built on top
    of Medusa, like the rest of UpLib.  No one's working on Medusa.

Bill