[Mailman-Developers] Opening up a few can o' worms here...

Barry A. Warsaw barry@zope.com
Tue, 16 Jul 2002 21:34:16 -0400


>>>>> "CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:

    CVR> Actually, yes. I won't be working 65+ hours a week any more,
    CVR> so I sort of get my life back, and may actually have time to
    CVR> think stuff through and do more than emergency
    CVR> patching... (for more, see
    CVR> <http://www.chuqui.com/cgi-bin/mwf/topic_show.pl?tid=348>). Also
    CVR> means I can actually start some non-Apple hacking again, I
    CVR> hope. And what I'll be doing is lots of fun, although the
    CVR> next six weeks is going to be a crunch. Still doing email,
    CVR> just off building a new custom system for stuff I can't talk
    CVR> about...

Very interesting, and congrats on getting your life back.  Also, my
apologies for not responding earlier, but I think you understand
probably as well as anybody. :)

>   CVR> One thing we're definitely doing is moving to a cloaked
>   CVR> archive. Since we already distribute all archives out of

> So these are public archives that need to be scrubbed, right?  Until
> now, Mailman has taken the approach that public archives are feed
> right off the file system by the http server.  We could still do that
> if we scrubbed the messages before we archived them, although that
> doesn't help with existing archives unless you re-generate them.

    CVR> Here's why I won't do that. I want to keep ONE set of
    CVR> archives. You can't scrub those archives for two
    CVR> reasons. What if someone writes looking to get in contact
    CVR> with the author of a message? If the archive is scrubbed,
    CVR> that info is gone. And (god forbid), you get into a legal
    CVR> tangle? That's your legal record of what was said on the mail
    CVR> list and who said it. If you scrub it, and someone does
    CVR> something actionable or libelous and you get a court order to
    CVR> provide that data? You're hosed.

Excellent points all, I completely agree.

    CVR> On a more likely note -- I can see where you might want the
    CVR> option to show the archives unscrubbed to validated users,
    CVR> and only scrub the public archives. As paranoid as I'm being
    CVR> today, I'd STILL like to find a way to let subscribed users
    CVR> see the archives unscrubbed. Which you could do by setting a
    CVR> cookie that the CGI could accept and change it's behavior.

Yup, all possible if we give up the notion of vending the public
archives from disk.  We pay in cpu, but oh well, that's cheap these
days, isn't it?

    CVR> So I really like leaving the archives unmodified, and doing
    CVR> the scrubbing via CGI. It also allows you do to other things,
    CVR> like header cleanups (and you could potentially let a user
    CVR> set a cookie to define minimal or full headers, say...) and
    CVR> some quickie cleanup against unwrapped text and some other
    CVR> incidental archive glitches.

    CVR> I come from a newspaper family, so I have a bias towards "you
    CVR> don't unpublish stuff, you don't change it once it's
    CVR> published". But I think there are good reasons to avoid
    CVR> sanitizing the archives, and instead sanitizing the delivery
    CVR> of those archives -- if only because if your policies change,
    CVR> all you need to change is the CGI. And it gives you the
    CVR> ability to set up different sets of abilities per user or per
    CVR> list if you want, too.

Again, excellent points.

> So one question is: does the performance trade-off we made 5 years ago
> still make sense?  Should we just be vetting all archives through a
> cgi, in which we can do fun stuff like cleanse it of email addresses?

    CVR> One of the big things I dislike about Mhonarc is that
    CVR> archives are a rather low-usage system, but maintaining the
    CVR> Mhonarc index pages is rather intensive use of system
    CVR> resources. Sort of like usenet -- you do a lot of work on
    CVR> everything, in case someone wants anything. I think simply
    CVR> storing the archives and sanitizing on demand is lower
    CVR> overhead. It also means pipermail won't need ANY changes --
    CVR> you simply feed it out through the CGI instead of directly,
    CVR> and everything magically sanitizes...

Yup.  Wanna help write the script?

> We'd obviously have to get rid of the easy access to the raw mbox
> file, so another question is whether that's still useful.

    CVR> Honestly? I don't think so. I find them real kludgy. I ended
    CVR> up doing a new archiving system (one file per message) via a
    CVR> perl script. We're about to take our new search engine out of
    CVR> beta with the thing, finally.

I find the mboxes really handy for gathering statistics, but maybe
because Python has some really nice tools to troll through them
(e.g. we use the python-list mbox to stress ZODB).  And it's also
handy if you move lists, but I think that's about it.  I'm sure
"regular users" wouldn't care if we hid the mboxes.  BTW, that's all
true even if you go to a one-file-per-message layout a la mh.

> Also, what heuristic do you use to search for email addresses, and
> what do you scrub them with?

    CVR> Still being worked on. Right now, I'm basically doing a
    CVR> <wordboundary><nonwhitespace>@<nonwhitespaceordot><dot>nonwhitespace><wordbo
    CVR> undary>. I don't know how strongly we'll refine it.

Cool.

>Do you want to attempt to obscure the
> address (e.g. "barry--at--python--dot--org")

    CVR> Anything you programmatically obscure will be
    CVR> programmatically de-obscured.  This technique is bogus and
    CVR> guaranteed to fail as soon as the spammers care enough. It's
    CVR> pretty clear even the "randomized obscuring" of slashdot is a
    CVR> failed technique, since spambots don't have to decode ALL of
    CVR> those formats, just some of them, and then cycle throug the
    CVR> site enough times....

    CVR> Sorry, I find this is a false security. Makes the users feel
    CVR> better, accomplishes nothing useful, so in reality, users get
    CVR> lazy and careless. So to some degree, I feel it's worse than
    CVR> nothing. I'm planning on replacing email addresses with
    CVR> something useful like [email address deleted].

Agreed.

>   CVR> disclosing that info. It creates other problems -- you can't
>   CVR> see a posting in the archive and send email to that person
>   CVR> with more questions (or answers), but that seems trivial
>   CVR> compared to the problems the spammers are causing.
> 
> It kind of plays into Reply-To: munging doesn't it?  If you won't be
> able to reply to the original author, because we're anonymizing
> messages, then you might as well munge Reply-To: to go back to the
> list because that's the only posting address that makes sense.

    CVR> Yes (he says, grimacing).

:)

    CVR> If you sanitize the archives, I don't think it affects the
    CVR> list. There are simply NO mailtos any more in the archives.

    CVR> If you go the step further and anonymize the postings ON the
    CVR> list, so subscriber email addresses simply are never shown to
    CVR> other subscribers under any circumstances (ugh. Urp. I can't
    CVR> believe I'm saying that. This is so anti-community it hurts),
    CVR> you have no choice and reply-to has to point to the list,
    CVR> since it's the only contact point left.

Yup.

    CVR> If you instead turn the list server into a forwarding agent,
    CVR> as in:

> Or should Mailman get into the anonymous resender game?  There's
> probably a lot we could do here, but given the political risks of
> anonymous resenders, do we even want go there?

    CVR> Is it an anonymous remailer? We're making no pretense of
    CVR> anonymity here.  We're acting as a forwarding agent, ala
    CVR> hotmail.com or mac.com. You mail to id13194@python.org, and
    CVR> it ends up in my mailbox. The fact that we're not explicitly
    CVR> denoting the real email address doesn't make us an anonymous
    CVR> remailer -- that'd be a policy issue, actually. I suppose you
    CVR> could take it that step further, but you could also set it up
    CVR> so validated subscribers could get to the real addresses.

    CVR> The model I'm thinking of is like many forum systems. If
    CVR> you're a guest, you don't get access to email info. If you're
    CVR> a subscriber, you log on, and they magically appear. In the
    CVR> case of mailing lists, since oyu lose control of the e-mail
    CVR> address once it leaves the site again, you handle this by
    CVR> only using the remailer address in mail that leaves the site,
    CVR> but a subscriber could go to the list system and look a user
    CVR> up. That gets us away from the politics of the anonymous
    CVR> stuff.

Hmm, maybe you're right.  You've got to keep those random forwarding
addresses alive for a long (configurable) time so that replies will
continue to work.

    CVR> You have nailed it on the head. Which is why I brought it
    CVR> up. Not because this is the way it has to be in the future,
    CVR> but because all this is making Mailman's job a whole lot more
    CVR> complex (we were whining about that at work today, or at
    CVR> least I was and everyone was nodding sympathetically and
    CVR> looking for an open window -- email used to be pretty easy
    CVR> and straight forward. And now.....). But not just because all
    CVR> this crap is getting in the way, but also that fixing this
    CVR> crap is overkill for some environments, and going to be NOT
    CVR> ENOUGH in others.

Exactly.  Here's the trick: for those who it is not enough, get them
to pay enough for it that it could sustain a business.  That way, you
keep the overkill crowd happy with the free stuff, which the super
paranoid help subsidize.

    CVR> Damn, that sounds good, but -- I've had to give up crab and
    CVR> shellfish (I've developed an intermitten sensitivity to
    CVR> it. Sigh!) and I'm staying in cupertino where I'll be manning
    CVR> the war room this week making sure buttons get pushed when
    CVR> they need pushed, and not a minute before....

Ah too bad (about both!).  The offer of some cold ones (of a liquid of
your choice) stands if you ever make it to DC. :)

-Barry