[Mailman-Developers] New Pipermail hacks (was Re: Ok, it works! ...)

Thu, 25 Oct 2001 00:48:34 -0400

>>>>> "BAW" == Barry A Warsaw <barry@zope.com> writes:

    BAW> If you're watching the CVS log messages, you might see some
    BAW> checkins to address the problems with Pipermail in 2.1a3.
    BAW> Had an all day meeting today, and I'm beat so I'll email more
    BAW> about it tomorrow, but I think I have a neat solution that
    BAW> will also address Ben's patch to clean attachments out of the
    BAW> archives, and may serve as a basis for a built-in de-mimer.

So here's the scoop.  I've been thinking about Ben Gertzfield's code
to sanitize the archives, and I've been mulling about the de-mime
stuff.  It all came to a head when 2.1a3 broke archiving for multipart
messages.

Here's what I've now got in cvs and it seems to work fairly well.
Only more testing will tell for sure.

There's a new handler module called Scrubber.py, but it's not in the
primary pipeline.  Only Pipermail is going to call it, and that via
the new mm_cfg.py/Default.py variable ARCHIVE_SCRUBBER.

This module hardcodes the following de-mime decisions:

- text/plain parts are passed through unchanged

- text/html parts are removed completely.  If the outer message is of
  type text/html then the whole message is discarded
  (i.e. DiscardMessage is raised).

- For all other non-multipart parts, we treat them as "attachments" by
  pulling the decoded payload out of the message, storing it in a file
  inside the list's private archive directory
  (e.g. archives/private/mylist/attachments) and rewriting the payload
  of the part to include a description of the attachment.

  Included in this description is a url to the attachment file, which
  Pipermail will hyperlink.  One drawback here is that if archives are
  switched from public to private, or vice versa, all the attachment
  urls will break.  But you could re-run bin/arch to regenerate the
  whole thing -- the key being that Scrubber works only on a copy of
  the message being prepped for the archiver, /not/ on the message
  being saved in the mbox.

- multiparts are ignored for the first pass, but are recursed to
  perform the above cleaning.

Then the entire scrubbed message is converted into a flat message,
where only the headers are parsed and the body is slurped in one gulp;
it isn't parsed recursively.  Along the way, we throw out the headers
for any internal parts, and we play games with the inter-part boundary
strings so they are move useful (yes, this is a kludge).

There's even more kludgery involved to get Pipermail to archive
scrubbed message without having to rewrite huge chunks of inscrutable
code.  But it seems to work.

Now, the interesting thing is that Scrubber.py is written so that it
/could/ be used in the main pipeline.  E.g. it supports the proper
signature and semantics for use in the pipeline.  But I'm not adding
it there for now primarily because it isn't configurable via the web.
All its decisions above are hardcoded because getting the u/i right is
more work than I want to do right now.

But if you were interested in mainlining Scrubber.py, here's how you
might do it: Add it to GLOBAL_PIPELINE in your mm_cfg.py.  I would
suggest sticking it after ToArchives so that the mbox gets the
original unscrubbed message (this lets you adjust the scrubber's
behavior for archive purposes and regenerate from the raw mbox).
In fact, what I'd do is move ToArchive to just after the Hold module,
and stick Scrubber just between Hold and Tagger.  This is untested.

I think this will give us a foothold into providing a cleaner archive
with Pipermail, and to experimenting with Mailman supported
de-mime-ification.   Probably the best that'll happen for MM2.1.

Enjoy,
-Barry