[Mailman-Developers] Protecting email addresses from spam harvesters

Barry A. Warsaw barry@zope.com
Mon, 25 Feb 2002 12:27:23 -0500


I /think/ I've caught up on this thread, but I'm sure I've missed a
bunch.  As I see it there are really these issues to protecting email
addresses in Mailman:

1) list admin addresses
2) public archives
3) private archives
4) raw archive
5) list rosters

For #1, MM2.1 changes what gets included at the bottom of list pages.
The admin's personal address is no longer included in the link's text
or in mailto: href.  In the mailto: you'll see something like
mylist-owner@dom.ain and in the text you'll see something like "barry
at zope.com".  I see no point in trying to obscure the former -- or
put it behind a web form -- because it's easily guessed given a probe
of existing lists, as is every other list-related email address.  More
on protecting the -owner from spam below.  I claim that the
guessability is a feature, btw.

You can argue that "barry at zope.com" isn't obfuscated enough, and
you might be right.  I'm against any image or JavaScript approach to
protecting these because I really do want to keep Mailman's web
interface as pedestrian as possible.  In principle I don't mind if
JavaScript or images are used, but they should never be the only way
to navigate a Mailman site.  Mailman must degrade gracefully for
browsers that either don't support these features or have them
disabled.  I'd do the same with cookies if I could figure out how to
do low-frustration-factor authentication without them.

(Aside: I really really hate websites that are only viewable with
JavaScript on, and I often send a friendly ADA-ish noodge to webmaster
when I find such beasts, although it rarely does any good).

MM3 will likely integrate admin addresses and list memberships into an
object called a "roster" (essentially just a list of email addresses).
This will let us define a pipeline for each roster, which could
include a spam filter that performs an action based on some criteria
(e.g. drop it, reject it, mark a header, etc.).  So we can do more
protection on the -owner address than we can do now (without
hacking).  Rosters and the improved user database will allow us to
actually equate admin email addresses with Real Names, so you could
conceivably see something like

    List run by <a href="mailto:mylist-owner@dom.ain">Barry Warsaw</a>

at the bottom of the pages.  You'd be within your rights to argue that
end users never even need know who admins the list, but I think it
helps to avoid the "faceless droid" syndrome.

Mailman should avoid getting deeply into the spam detection and
prevention business, except for some really really basic stuff
(probably not much more or less than it does now).  It should
integrate well with external spam detection programs like SpamAssassin
or commercial equivalents.  E.g. if we always send the message through
SA, and the message gets some score, we could decide to hold messages
below say 5.0 on the Spamster Scale, discard anything about 5.0, etc.

As for #2, I'd go for the low-tech approach of simply discarding the
hostname part of the email address in all public archives.  Certainly
this is easy in the headers, and we'll have to decide whether we're
going to expend the resources to do body searches for email addresses,
and obfuscate those as well.  If people want to make contacts based on
some public archive message, they can email the list.  Until we've got
web-posting, I don't think it matters if they lose the full email
address in the public archives.

As for #3, I don't mind not obscuring the email addresses since a
login will be required.  If we think we don't trust the current
private archive login procedures to be secure against bots, then we
can fix that, but I don't see it as a high priority.

#4 is interesting too.  I'm not against putting the raw archive behind
a turing-test, since I suspect that very few people will ever want
it.  It means that we won't be able to write an automated wget-ish
script to do off-site backups, but so be it.

Things to note for #'s 2-4:

- The Pipermail implementation has lots of well-known problems.  I'm
  personally not willing to spend a lot of time fixing them, and I
  still recommend Real Sites use a Real Archiver.  I've just thrown
  the majority of the email obfuscation problems over the fence into
  someone else's back yard <wink>.

- Adding public archive obfuscation is fine and dandy for new messages
  added to the archives but what about all the existing archived
  messages?  Re-running Pipermail (i.e. bin/arch) to regenerate from
  scratch has two significant drawbacks.  1) Message url's can change,
  especially if you also fix broken From_ delimiters, and that in turn
  breaks bookmarks, 2) on large mboxes, you simply can't do bin/arch
  because of memory problems.

- Someone needs to step up and "own" Pipermail if any of these
  problems are going to be fixed, or if the obfuscation is going to
  happen.

- Remember that Pipermail itself is completely optional.  An API is
  defined between Mailman and the archiver and that's all the
  interaction they have.  Maybe the API needs to be more elaborate to
  support obfuscation.  It definitely needs some changes if we ever
  want to add some of the features I'd like to add (but that's
  off-topic here).

- I'll note that one of the early design decisions for Pipermail was
  that public archives should be vended directly from the file system
  for performance reasons.  That decision may not be appropriate for
  today's operations.  Certainly maintaining two static versions of
  the pages isn't feasible, so I think you have to vend one or the
  other (probably the obfuscated version) from a cgi.

- Anybody who's really interested in archiving (I'm not) should take a
  look at Zest and think about contributing to it, so that it does
  stuff like this correctly from the start.  I really like the Zest
  interface and think it ought to at least be a bundled alternative to
  Pipermail, some day.

Nobody's even mentioned #5, which are available publically via the
"Visit Subscriber List" button, or the email command "who" to the
-request address.  If I were a spam harvester, I wouldn't even bother
with scanning the archives if either of these were publically
enabled.  When you turn them off, especially the former, just remember
that you've now made it much harder for Joe User to unsubscribe
themselves.  Catch 22.

-Barry