[Archiver-dev] Continuously crawling an email list

Hank Leininger hlein at marc.info
Thu Feb 10 16:23:40 CET 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 10 Feb 2011, Jeff Marshall wrote:

> Hi Matt.  For mail-archive.com we essentially do what you mention: we have a
> global email address (archive at mail-archive.com) that list admins can
> subscribe to their list, and then our service converts the received messages
> into MHonArc archives.  When a list admin comes to us with a request to load
> up old archives we chew through the old mbox files and pump them into our
> MHonArc-digesting code.
>
> I'm not aware of any scripts that would follow a list in an easier fashion.
> A benefit to subscribing an address to the list is you make the list admin
> aware of who you are, and they have the option of saying no.

For MARC it is similar, although we have one distinct email alias per list
(typically listname at marc.info) that we subscribe (and of course the
back-end is different, we stuff things in an SQL DB rather than
MHonArc).  For us inserting a message also queues it for
search-indexing.

The message-insertion script can handle a single message or many, so
it's the same codepath if I get my hands on (or reconstruct) mboxes of
historical list traffic.  I've also got various utility scripts to pull
down and clean up messages from different existing archive types.  If
you (or anybody else for that matter) wants them, let me know, I'll make
them available somewhere.  (Maybe even with a <6 month turnaround. :( )

I suppose you could use IMAP to poll/pull list-messages from a
subscribed account, like if you didn't want to have an SMTP path into
your archive-cooking server, but in most cases that just seems to me
like unnecessary extra moving parts.

Thanks,

Hank

> On Tue, Feb 8, 2011 at 11:20 AM, Matt Chaput <matt at sidefx.com> wrote:
>
>> Hi, just saw this list, it seemed related to what I want to do, so I signed
>> up :)
>>
>> I want to create a web app that indexes email messages as they appear in a
>> MailMan list and makes them available for search.
>>
>> It seems like one way to do this would be to create an email account, sign
>> it up to the list, and use an IMAP4 client to poll the account and download
>> new messages.
>>
>> But is that the best way? For one thing, it doesn't allow the batch
>> indexing of old list messages. For that I'd have to download tar'd archives
>> and support separate indexing paths for old (archived), newish (downloaded
>> recently) and new (just pulled out of the account) messages.
>>
>> Is there a good way to have a script "follow" an email list? And better
>> yet, is there already code out there to do so ;)
>>
>> Thanks,
>>
>> Matt
>> _______________________________________________
>> Archiver-dev mailing list
>> Archiver-dev at python.org
>> http://mail.python.org/mailman/listinfo/archiver-dev
>>
>

-----BEGIN PGP SIGNATURE-----

iD8DBQFNVAL8qP26fHCT+PMRAl9pAJ9Z7yhE4LCkvt0vo/M1kObOuTZcMACfdzMS
z2BU1KwXwdI0FLlG1tSyvgI=
=1NhY
-----END PGP SIGNATURE-----


More information about the Archiver-dev mailing list