From terri at zone12.com  Tue Jul  3 05:06:31 2007
From: terri at zone12.com (Terri Oda)
Date: Mon, 2 Jul 2007 23:06:31 -0400
Subject: [Mailman-Developers] Improving the archives
Message-ID: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>

Since I've largely finished up the coding contract that was eating up  
a lot of my time, I'm thinking that I'd like to do some coding for  
fun.  And nothing says fun like trying to fix the Mailman archives! ;)

I'm trying to remember all the things people have suggested for the  
archives in the past so I can figure out what needs to be done and  
what might be nice to have, and see if this is doable in the time I  
have in the foreseeable future.

The big things people wanted most, if I recall correctly, included:

- modernized HTML/CSS/Themes (preferably to match a modernized web  
interface... is that all set up now?)
- archive links that won't break if the archive is rebuilt
- better address obfuscation (maybe by generating pages through cgi)
- search
- not adding a billion dependencies to Mailman

Here's the list from the wiki's Mailman 2.2 page: http:// 
wiki.list.org/display/DEV/Mailman+2.2

     *  Reconsider using a 3rd-party archiver
     * Perhaps URLs to messages should be based on message-ids  
instead of message numbers so that regenerating archives can't break  
links. This must include backward compatible links
     * Ditch direct access and vend all archive messages through CGI  
so that we can do address obfuscation, and message deletion, etc. on  
the fly (with caching of course, but have to worry about web crawlers).
     * Add RSS feed
     * Allow for admins to remove or edit messages through the web.
     * Move archive threads into another list?
     * Put archives in the list/mylist directory.
     * Add a search option
     * Make archives default template look and feel similar to Web UI  
(whatever it looks like after the Summer of Code project is done)
     * Make archive templatable (at least by changing CSS) so they  
can match people's existing site look-and-feel
     * MUAs usually make URLs clickable. An new Archive could be used  
when posts are distributed, in the footer, so that each message has a  
link to the whole thread in the Archive.
     * Present all messages in a thread at once, and offer plaintext  
download of the whole thread
     * Put messages into a database and/or move away from mbox as the  
canonical storage format.

So the questions are:

(1) Is anyone working on this already?
(2) What else is on people's wish lists for a pipermail replacement?

  Terri


From huston at astro.princeton.edu  Tue Jul  3 13:36:23 2007
From: huston at astro.princeton.edu (Steve Huston)
Date: Tue, 03 Jul 2007 07:36:23 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
Message-ID: <468A34B7.3080201@astro.princeton.edu>

I'll admit to not having read previous discussions on this topic, but
I'll also add my 2 <insert-lowest-denomination-coin> here:

On 7/2/07 11:06 PM, Terri Oda wrote:
> - better address obfuscation (maybe by generating pages through cgi)

I run a few Wordpress sites, and there's a plugin I use called
PHPEnkoder which does a good job of this.  It basically wraps the
address around a little bit of Javascript; if you have Javascript turned
on in the browser, it's seamless, and if not you see "Javascript
required to view address" or something like that.  The theory is that
bots and such don't run JS, so it's "safe" from harvesting.  I'll leave
it to the list as to how true an assessment this is, but it Works For Me :>

>      * Add a search option

I know there's been patches around forever that integrate ht://Dig with
Pipermail; maybe some way to do this, while still making it an option
that can be tuned?  If ht://Dig is there and you turn on the option, it
works, but if it's not then it's not required?  This would satisfy the
"not adding a billion dependencies", but may be overkill as well.  I'll
also happily admit to not knowing much about the cost of search engines
to a system.

>      * MUAs usually make URLs clickable. An new Archive could be used  
> when posts are distributed, in the footer, so that each message has a  
> link to the whole thread in the Archive.

This would be a Godsend.  A group at work here runs an old homebrewed
exploder, and a few years ago I tried to convert them to Mailman.  They
liked everything they saw, up until the point where they couldn't refer
to some kind of short and simple message number, and get right to that
message in the archive.  The current system generates a number based on
a simple incrementing index of the list, and many months after a mailing
people will refer to "message #483", and know they can view it at
http://hostname/foo/listname/483.html - which is also posted in the
footer of the message sent out.  Of course, if the archives were based
on Message-ID headers, this may make such a number a bit unwieldly, but
if it were some kind of simple-ish system I might finally get rid of
those old lists :>

-- 
Steve Huston - W2SRH - Unix Sysadmin, Dept. of Astrophysical Sciences
  Princeton University  |    ICBM Address: 40.346525   -74.651285
    126 Peyton Hall     |"On my ship, the Rocinante, wheeling through
  Princeton, NJ   08544 | the galaxies; headed for the heart of Cygnus,
    (609) 258-7375      | headlong into mystery."  -Rush, 'Cygnus X-1'

From barry at python.org  Wed Jul  4 02:05:12 2007
From: barry at python.org (Barry Warsaw)
Date: Tue, 3 Jul 2007 20:05:12 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
Message-ID: <849198AE-DEC3-44C8-A090-470720624185@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 2, 2007, at 11:06 PM, Terri Oda wrote:

> Since I've largely finished up the coding contract that was eating up
> a lot of my time, I'm thinking that I'd like to do some coding for
> fun.  And nothing says fun like trying to fix the Mailman archives! ;)

That would be awesome Terri!  It's an aspect of Mailman that sorely  
needs attention, and you will gain (even more) fame and fortune by  
working on it. :)  I totally support this effort.

> I'm trying to remember all the things people have suggested for the
> archives in the past so I can figure out what needs to be done and
> what might be nice to have, and see if this is doable in the time I
> have in the foreseeable future.
>
> The big things people wanted most, if I recall correctly, included:
>
> - modernized HTML/CSS/Themes (preferably to match a modernized web
> interface... is that all set up now?)

It's not, but Andrew Kuchling will be working on this.  I haven't yet  
revealed detailed plans, though I'm working on an email about this  
over the U.S. July 4th holiday.  But I suppose it's time for a quick  
summary: I'd like to get a Mailman 2.2 out with an updated u/i sooner  
rather than later, and if possible an updated archiver would be one  
of those few other new features that I think could go into a 2.2.   
OTOH, it would be fine if we pushed that off to Mailman 3 too, but it  
leveraged all the u/i work to be done in 2.2.

> - archive links that won't break if the archive is rebuilt

Yes, this is absolutely critical, in fact, I'd put it right at the  
top of the list, even more so than a u/i overhaul.  Stable urls, with  
backward compatible redirecting links if at all possible, would be  
fantastic.

Along with that, I would really like to come up with an algorithm for  
calculating those urls without talking to the archiver.  This would  
allow the list delivery queue to calculate the List-Archive: header  
value and any message header/footer substitutions before the message  
hits the archiver.

> - better address obfuscation (maybe by generating pages through cgi)

I'd still love to do this, and I think were it not for crawlers, we  
could get a lot of mileage out of creation on demand and caching.   
But how do you handle Google crawling your archive?

> - search

Another huge huge feature.

> - not adding a billion dependencies to Mailman

Definitely.  I'm also not opposed to changing the interface between  
Mailman and the archivers if necessary.

> Here's the list from the wiki's Mailman 2.2 page: http://
> wiki.list.org/display/DEV/Mailman+2.2

We should probably start a separate archiver wiki page.  I plan on re- 
organizing the 2.2 page anyway, so I'll probably end up doing that if  
you don't get around to it before me <wink>.

> (1) Is anyone working on this already?

Not that I know of.

> (2) What else is on people's wish lists for a pipermail replacement?

Other things high on my list are ditching the crufty storage  
currently being used (pickles begone!), an RSS feed, and a 'message  
storage' which could be used to vend archived messages through other  
delivery transports, such as imap or nntp.  But I'd be willing to put  
all that off for stable urls, an updated u/i, and searching.

Anything I can do to help, please let me know.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRorkOHEjvBPtnXfVAQLw0wP/TFgXxFAcK+3QiDG4jkyPCVVpP0EqATwB
nYfUDrf0ytuTphFMM4gJmWbZdtR1HJ2xqNOit18QTsM/pjTiIDB++nH0IoRkRwy3
qs4JdBb+m3Amuxaaa4dQp+nWQt2yUMsF/HWp3BS/vx8oCfkjMhOKDI29/UG9jU+L
L64QzWeywGw=
=ewlo
-----END PGP SIGNATURE-----

From barry at python.org  Wed Jul  4 02:13:56 2007
From: barry at python.org (Barry Warsaw)
Date: Tue, 3 Jul 2007 20:13:56 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <468A34B7.3080201@astro.princeton.edu>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<468A34B7.3080201@astro.princeton.edu>
Message-ID: <13B9C232-8295-4533-B49B-205B901AA8E7@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Steve makes me think of a couple of other wish list items.

On Jul 3, 2007, at 7:36 AM, Steve Huston wrote:

> On 7/2/07 11:06 PM, Terri Oda wrote:
>> - better address obfuscation (maybe by generating pages through cgi)
>
> I run a few Wordpress sites, and there's a plugin I use called
> PHPEnkoder which does a good job of this.

I have this idea that you could gateway messages from an archive or  
mailing list to and from a bulletin board forum.  Maybe this doesn't  
fall within the scope of the archiver because I could see a 'forum  
queue' like we have an nntp queue, but in that case, being able to  
calculate an archive url without talking to the archiver becomes  
important again.  It would be nice in that case to put a link to the  
archive message in the forum post.

>>      * MUAs usually make URLs clickable. An new Archive could be used
>> when posts are distributed, in the footer, so that each message has a
>> link to the whole thread in the Archive.
>
> This would be a Godsend.  A group at work here runs an old homebrewed
> exploder, and a few years ago I tried to convert them to Mailman.   
> They
> liked everything they saw, up until the point where they couldn't  
> refer
> to some kind of short and simple message number, and get right to that
> message in the archive.

This reminds me, I would love to have a link in an archive message  
that I could click to get the message sent to me, as it originally  
appeared on the mailing list.  If I had that, I'd never need to  
locally save another mailing list post.  I'd just search for the one  
I wanted, go to the archive, click on the "send it to me" link, then  
do a normal reply in my mail reader.

> The current system generates a number based on
> a simple incrementing index of the list, and many months after a  
> mailing
> people will refer to "message #483", and know they can view it at
> http://hostname/foo/listname/483.html - which is also posted in the
> footer of the message sent out.  Of course, if the archives were based
> on Message-ID headers, this may make such a number a bit unwieldly,  
> but
> if it were some kind of simple-ish system I might finally get rid of
> those old lists :>

This would be possible with today's system, but it leads to unstable  
urls, especially when you consider archive scrubbing (which, come to  
think of it, is another wish list item ;).  We'd like for an admin to  
be able to easily pull an archive message, but it's even worse than  
that.  Sometimes an admin has to scrub the actual backing message  
store (e.g. today's mbox file).  This will change the message counts  
and thus the incremental indexes.

Maybe a way to think about this is that the canonical url is based on  
the message-id, but then there's some way to distill even this down  
to a tinyurl or simple integer that would be stable in the face of  
full archive regenerations.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRormRHEjvBPtnXfVAQIHYwP/fLnY/pebRlhrFeUpPJu5VfZNyR24oLId
qjZ4F2MHW25LcemvGzpeUSgXRQJk2LQIQKSlYYtTM+8xcStey4IvDnPLmzX5MQOC
xiI9PznZHdLmbF9SaUDZQZBRKZhqCNeslZ5zpnN35KStL3NlTc6PkBylzIC7Y47F
a3RxMEOgMaA=
=HM9I
-----END PGP SIGNATURE-----

From stephen at xemacs.org  Wed Jul  4 09:49:58 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Wed, 04 Jul 2007 16:49:58 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <849198AE-DEC3-44C8-A090-470720624185@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
Message-ID: <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:
 > > - archive links that won't break if the archive is rebuilt
 > 
 > Yes, this is absolutely critical, in fact, I'd put it right at the  
 > top of the list, even more so than a u/i overhaul.  Stable urls, with  
 > backward compatible redirecting links if at all possible, would be  
 > fantastic.

+1.  I've been wanting to do something about this, and have made
proposals (not back with code, mea maxima culpa) for design.  I would
definitely be happy to help with this, but given time constraints, it
would be nice if somebody else could take the lead.

 > Along with that, I would really like to come up with an algorithm for  
 > calculating those urls without talking to the archiver.

Brad didn't like this when I suggested it before, but I didn't really
understand why not.  Anyway, FWIW:

I suggest adding an X-List-Received-ID header to all messages.  I
haven't really thought through whether the UUID in that field should
be at least partly human-readable or not, but that doesn't matter for
the basic idea.[1]  The on-disk directory format would be

/path-to-archive/private/my-list/Message-ID

for singletons (Message-ID is the author-supplied ID) and

/path-to-archive/private/my-list/Message-ID/List-Received-ID

for multiples.  These would be created on-the-fly when they occur.
They can be served as static pages.  For almost all messages, the bare
URL

http://archives.example.com/my-list/Message-ID

should Just Work (ie, return a no-such-object result or a single
message).  Where it does not, you get an index of all pages with that
message ID.

The main drawback to using Message IDs that I can see is that broken
MUAs may supply no Message-ID, or the same one repeatedly.  In the
former case, as a last resort Mailman can supply one, but that won't
help people who get a personal copy and want to find the thread.
However, I see no way to help them, anyway, beyond a generic archive
search engine.  In the latter, you get lots of messages matching the
Message-ID, and while most lists should have *zero* problems, a list
that has any instances of this problem would have many.  Again I can't
see a good way to deal with this other than a general search facility,
as computing a digest of headers or content is hard to do reliably.
Providing an index of matching posts seems like a reasonable approach,
which can be efficiently implemented (eg, as static pages).
Furthermore, the examples I've seen of both in the last few years have
all been either spam or (in the case of duplicate Message-IDs) actual
duplicates due to some mail system problem or itchy user fingers.

A minor drawback to my proposal is that if a message gets archived as
a singleton for that Message-ID, then a duplicate arrives, previously
created references in the archive will of course now return an index
rather than the desired message.  Ie, there is data corruption.  This
can be dealt with in several ways; the easiest would be to provide a
"if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking-for-me"
link when creating the directory for multiple instances.

There's also a *very* minor benefit: repeat sends will be immediately
recognizable without checking Message-ID.

Footnotes: 
[1]  By partly human-readable I mean containing list-id and date
information.  The idea would be to have the date come first, so that
users would have a shot at identifying which of several messages is
most likely, and this would be searchable by eye with simply an
ordinary sorted index.


From jam at jamux.com  Wed Jul  4 18:58:20 2007
From: jam at jamux.com (John A. Martin)
Date: Wed, 04 Jul 2007 12:58:20 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> (Stephen J. Turnbull's
	message of "Wed, 04 Jul 2007 16:49:58 +0900")
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <87k5tgxg0j.fsf@athene.jamux.com>

A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 154 bytes
Desc: not available
Url : http://mail.python.org/pipermail/mailman-developers/attachments/20070704/b9919498/attachment.pgp 

From Dale at Newfield.org  Wed Jul  4 19:16:58 2007
From: Dale at Newfield.org (Dale Newfield)
Date: Wed, 04 Jul 2007 13:16:58 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <13B9C232-8295-4533-B49B-205B901AA8E7@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>	<468A34B7.3080201@astro.princeton.edu>
	<13B9C232-8295-4533-B49B-205B901AA8E7@python.org>
Message-ID: <468BD60A.2020709@Newfield.org>

I'm all for someone taking ownership of this long-neglected component -- 
thank you for doing so!

Barry Warsaw wrote:
> Maybe a way to think about this is that the canonical url is based on  
> the message-id, but then there's some way to distill even this down  
> to a tinyurl or simple integer that would be stable in the face of  
> full archive regenerations.

The resistance to basing this on message-id has always been that there's 
no guarantee of uniqueness...
...but I believe each list has some sort of counter for how many 
messages it's seen, so we could add another header with that number, and 
use as a unique id the two concatenated together...
(That way the archiver can know from the content of the header exactly 
how to generate the same unique id as mailman, which would allow for the 
url-in-the-footer to happen w/o first hitting the archiver.)

Just throwing out ideas,
-Dale

From jeff at jab.org  Wed Jul  4 21:30:04 2007
From: jeff at jab.org (Jeff Breidenbach)
Date: Wed, 4 Jul 2007 12:30:04 -0700
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <13B9C232-8295-4533-B49B-205B901AA8E7@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<468A34B7.3080201@astro.princeton.edu>
	<13B9C232-8295-4533-B49B-205B901AA8E7@python.org>
Message-ID: <e03b90ae0707041230m47110705t89cdbe3d2e4802cd@mail.gmail.com>

>Maybe a way to think about this is that the canonical url is based on
>the message-id, but then there's some way to distill even this down
>to a tinyurl or simple integer that would be stable in the face of
>full archive regenerations.

I'd suggest the reverse. Keep the canoncical archive URL short and
sweet, and then use a URL redirection service to map message-id's
to those URLs. It is the archiver's job to make it all work. For example,
the canonical  archive URL might stay exactly the way it is in pipermail.
But the archival link embedded in the message would instead go
to a redirection service.

http://mail.codeit.com/pipermail/zcommerce/2002-February/000523.html
http://mail.codeit.com/msgid?002701c4eb3d$07170ca0$3142003e at ADSL

The one other thing I'd ike to revisit is integration with third party
archival services. There are two obvious integration points; one is a
button in the Mailman list admin user interface that says "archive with
service X" not unlike the setting in Firefox that basically says "search
with service X". The other integration point is the archival link
discussed above. In which case it would be set to something like.

http://third-party-service/msgid?002701c4eb3d$07170ca0$3142003e at ADSL

Disclosure: I help run a third party archiving service, and this topic was
discussed quite a bit previously.  [1] Nonetheless it seems like a good
time revisit given the current discussion about archive wishlists.

[1] http://www.mail-archive.com/mailman-developers at python.org/msg08772.html

From jeff at jab.org  Thu Jul  5 06:48:30 2007
From: jeff at jab.org (Jeff Breidenbach)
Date: Wed, 4 Jul 2007 21:48:30 -0700
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707041230m47110705t89cdbe3d2e4802cd@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<468A34B7.3080201@astro.princeton.edu>
	<13B9C232-8295-4533-B49B-205B901AA8E7@python.org>
	<e03b90ae0707041230m47110705t89cdbe3d2e4802cd@mail.gmail.com>
Message-ID: <e03b90ae0707042148s1b81551dxc7d088fa1b38fb59@mail.gmail.com>

>In which case [the message body link] would be set to something like.
>
>http://third-party-service/msgid?002701c4eb3d$07170ca0$3142003e at ADSL

Just for fun, I did a trial implementation. It works, but the URLs are
too long.
For example, the URL below spends 59 characters on the messag-id, and
27 characters on the listname. We're  already over my comfort level (of
about 72 characters) and haven't even started to count the hostname, and
other URL-lengthening overhead. Maybe this was a bad idea after all.

http://www.mail-archive.com/search?l=mailman-developers%40python.org&q=e03b90ae0707041230m47110705t89cdbe3d2e4802cd at mail.gmail.com

Jeff

From jdennis at redhat.com  Thu Jul  5 18:09:29 2007
From: jdennis at redhat.com (John Dennis)
Date: Thu, 05 Jul 2007 12:09:29 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <849198AE-DEC3-44C8-A090-470720624185@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
Message-ID: <1183651769.10813.6.camel@finch.boston.redhat.com>

On Tue, 2007-07-03 at 20:05 -0400, Barry Warsaw wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On Jul 2, 2007, at 11:06 PM, Terri Oda wrote:
> 
> > Since I've largely finished up the coding contract that was eating up
> > a lot of my time, I'm thinking that I'd like to do some coding for
> > fun.  And nothing says fun like trying to fix the Mailman archives! ;)
> 
> That would be awesome Terri!  It's an aspect of Mailman that sorely  
> needs attention, and you will gain (even more) fame and fortune by  
> working on it. :)  I totally support this effort.

A little over a year ago I went on a search to find the best open source
archiver and at that time I came up with Lurker
(http://lurker.sourceforge.net) Since then I believe Lurker has seen a
major new revision. I also believe Lurker is the archiver used by
Debian.

So if you want to leverage existing open source archiving or at least
look at an example of what would be necessary to allow easy easy
external archiving integration with Mailman you might want to look at
Lurker.
-- 
John Dennis <jdennis at redhat.com>


From terri at zone12.com  Thu Jul  5 19:02:37 2007
From: terri at zone12.com (Terri Oda)
Date: Thu, 5 Jul 2007 13:02:37 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <1183651769.10813.6.camel@finch.boston.redhat.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
Message-ID: <C2CF9099-599E-4985-9B0D-37871E14219F@zone12.com>

On 5-Jul-07, at 12:09 PM, John Dennis wrote:
> A little over a year ago I went on a search to find the best open  
> source
> archiver and at that time I came up with Lurker
> (http://lurker.sourceforge.net) Since then I believe Lurker has seen a
> major new revision. I also believe Lurker is the archiver used by
> Debian.

I was hoping someone would post that link!  Lurker was best of breed  
last time I was looking, and I'd definitely like to see what we can  
leverage there.

  Terri


From barry at python.org  Sat Jul  7 18:35:30 2007
From: barry at python.org (Barry Warsaw)
Date: Sat, 7 Jul 2007 12:35:30 -0400
Subject: [Mailman-Developers] Mailman roadmap
Message-ID: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Now that we've successfully navigated the switch to Bazaar, it's time  
to lay out plans for future Mailman releases.  I've talked to several  
people about what to do about Mailman's future and I'd like to take  
this opportunity to describe my thoughts and get your feedback.   
First some background.

Mailman 2.1 is (shockingly) four and a half years old, having been  
initially released on 30-Dec-2002.  The last release in the series,  
2.1.9 was made almost a year ago.  In the meantime, Mark and Tokio  
have been doing a great job maintaining the 2.1 branch, with several  
important patches in the tree now that will eventually become  
2.1.10.  The problem of course is that we can't add any new features  
to the 2.1 family <wink>, so we should be thinking about a new major  
release.

I've been making good progress on the SQAlchemy/Elixir version, which  
will finally get rid of pickles and put Mailman on a Real Database  
(tm).  It's been clear to me for a while that this branch will have a  
unified user database.  It simply makes no sense to build the  
database back-end without once and for all fixing this design  
constraint.  I've always said that the unified user database will be  
in Mailman 3, and thus this branch is indeed called "Mailman 3.0".

I've been slowly building things back up from the ground floor.  The  
basic data model is in pretty good shape and I'm taking a religious  
test-driven approach to making things work again.  But the branch  
still needs a lot of work, and I have no ETA for Mailman 3.0.

In the meantime, Andrew Kuchling and others have volunteered to work  
on modernizing the Mailman web u/i, and Terri recently started a  
thread discussing updates to the archiver.  I think it makes sense to  
bless these efforts, towards the goal of releasing them in Mailman  
2.2.  I intend to create an official Mailman 2.2 branch in bzr where  
these efforts can land as they mature.  My hope of course is that  
we'll also be able to use much of this new code for Mailman 3.

I'd like to keep the changes for 2.2 focused on the web u/i and  
archiver, with a small number of additional features to be  
determined.  Mailman 2.2 should see no changes to the basic  
architecture or 'database'; we'll continue to use pickles by default  
for Mailman 2.2.  While I won't rule out other new features, I want  
to be very picky about those that are accepted for 2.2, and would not  
feel bad at all if we rejected or deferred until 3.0 most of those  
proposed.  Criteria for other 2.2 features must include minimal code  
impact with a high degree of reliability and stability.

I plan on updating the wiki pages to reflect this thinking, but I  
would like to get feedback from y'all about the plan.  It would be  
awesome if we could see a release of Mailman 2.2 some time in late  
2007 or early 2008.

Comments, question?

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRo/A0nEjvBPtnXfVAQJ+dwP7BXXLM749qO6CXWQKZw41pFN42jYfN6Kg
LjNAQ9IejAT/TISGrSgk8UyZ9kP6ajnOFvKIfJNTFJdytJg8/lvDQSeW1N0u7sR+
Wp0N1e0qA4qfqLYsqRR9W1MQhecdBO/yEJo8KDsOQdGnpfINSKZ40FUvPEbC40U7
C/T83gS+Vxs=
=JJZS
-----END PGP SIGNATURE-----

From barry at python.org  Sat Jul  7 18:36:54 2007
From: barry at python.org (Barry Warsaw)
Date: Sat, 7 Jul 2007 12:36:54 -0400
Subject: [Mailman-Developers] Mailman roadmap
Message-ID: <A2753DAD-DEFD-4966-B458-E21CCB93DFCE@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sorry, I forgot to cross-post this to mailman-users, so I'm reposting.
- -Barry

Now that we've successfully navigated the switch to Bazaar, it's time  
to lay out plans for future Mailman releases.  I've talked to several  
people about what to do about Mailman's future and I'd like to take  
this opportunity to describe my thoughts and get your feedback.   
First some background.

Mailman 2.1 is (shockingly) four and a half years old, having been  
initially released on 30-Dec-2002.  The last release in the series,  
2.1.9 was made almost a year ago.  In the meantime, Mark and Tokio  
have been doing a great job maintaining the 2.1 branch, with several  
important patches in the tree now that will eventually become  
2.1.10.  The problem of course is that we can't add any new features  
to the 2.1 family <wink>, so we should be thinking about a new major  
release.

I've been making good progress on the SQAlchemy/Elixir version, which  
will finally get rid of pickles and put Mailman on a Real Database  
(tm).  It's been clear to me for a while that this branch will have a  
unified user database.  It simply makes no sense to build the  
database back-end without once and for all fixing this design  
constraint.  I've always said that the unified user database will be  
in Mailman 3, and thus this branch is indeed called "Mailman 3.0".

I've been slowly building things back up from the ground floor.  The  
basic data model is in pretty good shape and I'm taking a religious  
test-driven approach to making things work again.  But the branch  
still needs a lot of work, and I have no ETA for Mailman 3.0.

In the meantime, Andrew Kuchling and others have volunteered to work  
on modernizing the Mailman web u/i, and Terri recently started a  
thread discussing updates to the archiver.  I think it makes sense to  
bless these efforts, towards the goal of releasing them in Mailman  
2.2.  I intend to create an official Mailman 2.2 branch in bzr where  
these efforts can land as they mature.  My hope of course is that  
we'll also be able to use much of this new code for Mailman 3.

I'd like to keep the changes for 2.2 focused on the web u/i and  
archiver, with a small number of additional features to be  
determined.  Mailman 2.2 should see no changes to the basic  
architecture or 'database'; we'll continue to use pickles by default  
for Mailman 2.2.  While I won't rule out other new features, I want  
to be very picky about those that are accepted for 2.2, and would not  
feel bad at all if we rejected or deferred until 3.0 most of those  
proposed.  Criteria for other 2.2 features must include minimal code  
impact with a high degree of reliability and stability.

I plan on updating the wiki pages to reflect this thinking, but I  
would like to get feedback from y'all about the plan.  It would be  
awesome if we could see a release of Mailman 2.2 some time in late  
2007 or early 2008.

Comments, question?

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRo/BJnEjvBPtnXfVAQL7iwP/TfPPvMsTnrrSxQAlvPjQoR27ySqUYh+P
yZCvGxxp9DgNoFQOWl0mo1QzZ9ozXtiFfIHx4CJLybOis+yuiq+BWtih2MJnGBf7
SzD8qsBOu6N4sE8sn4n0tdmXr1fnh4qnrgTobvBX+3toJtHNGQTEVEZCxiWb5fKq
JsUKDVVvOhQ=
=CVNK
-----END PGP SIGNATURE-----

From barry at python.org  Sat Jul  7 22:19:50 2007
From: barry at python.org (Barry Warsaw)
Date: Sat, 7 Jul 2007 16:19:50 -0400
Subject: [Mailman-Developers] Mailman roadmap
In-Reply-To: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
Message-ID: <C279DDB8-C9B2-447C-B05E-4B2301313993@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 7, 2007, at 12:35 PM, Barry Warsaw wrote:

> I intend to create an official Mailman 2.2 branch in bzr where  
> these efforts can land as they mature.

This branch is now live.

http://wiki.list.org/display/DEV/MailmanBranches

Cheers,
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRo/1Z3EjvBPtnXfVAQIwNAP+MgJj026MHEGwdXMoma9uNkTp8UzeLtCC
mKh7OkcmZPMzKrNdlztQ5OmU1N1SWf9medErjM7QcKeIR0y+9aUjC65j8mamwWPa
+XrbzdlWZoDxnO5qFh02rVNFATKRH00+ITiB6LvTEKJVxp9r+WL1sKq0FEElu9/W
zkl80deXVvQ=
=V7ce
-----END PGP SIGNATURE-----

From pabs at debian.org  Sun Jul  8 07:06:02 2007
From: pabs at debian.org (Paul Wise)
Date: Sun, 8 Jul 2007 15:06:02 +1000
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
Message-ID: <e13a36b30707072206h37a05ea5mc9169856f2189acf@mail.gmail.com>

On 7/3/07, Terri Oda <terri at zone12.com> wrote:

> I'm trying to remember all the things people have suggested for the
> archives in the past so I can figure out what needs to be done and
> what might be nice to have, and see if this is doable in the time I
> have in the foreseeable future.

At lists.indymedia.org, we use a patch that provides these:

* stable URLs based on a generated message id
* URLs to the archived message in the message headers
* message hiding

http://lists.indymedia.org/patches/imc-10-mmid_hide_posts.patch

It poses a bit of a migration issue since all the existing mboxes may
or may not have the mmid header in them. We worked around that by
having an special place for the old archives.

We've been meaning to move to lurker for years, but haven't had the
human resources and also there were some showstoppers:

* public/private lists - lurker couldn't do that properly when we looked
* lack of date-based index to the archives
* general navigation issues; stuff like linking between current thread
and nearby ones
* mailto links (has now been fixed)
* the migration nightmare

My personal opinion is that pipermail should be removed and mailman
should not contain a default archiver since there are plenty of good
archivers already (lurker, mhonarc etc). Adding wrappers around them
would be simpler than reimplementing them.

-- 
bye,
pabs

https://docs.indymedia.org/view/Main/PaulWise

From iane at sussex.ac.uk  Mon Jul  9 15:35:33 2007
From: iane at sussex.ac.uk (Ian Eiloart)
Date: Mon, 09 Jul 2007 14:35:33 +0100
Subject: [Mailman-Developers] Mailman roadmap
In-Reply-To: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
Message-ID: <DCAEC70D78293D9A88CEF915@lewes.staff.uscs.susx.ac.uk>


--On 7 July 2007 12:35:30 -0400 Barry Warsaw <barry at python.org> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Now that we've successfully navigated the switch to Bazaar, it's time
> to lay out plans for future Mailman releases. ....

> I plan on updating the wiki pages to reflect this thinking, but I
> would like to get feedback from y'all about the plan.  It would be
> awesome if we could see a release of Mailman 2.2 some time in late
> 2007 or early 2008.
>
> Comments, question?

All sounds very good. My two main problems with driving Mailman uptake here 
on campus are both to do with usability. The first is the current web 
interface, which was great when it was developed, but expectations have 
moved on. The second is the lack of a unified user database. So, it's great 
to see these items listed as the mail focus for 2.2 and 3.0 respectively.

WRT 2.2, I'd like to be able to offer something as simple to use as the 
list management features of Google Groups (which I use for some voluntary 
groups that I work with), but with the ability to expose additional 
functionality on request.

WRT 3.0, for enterprise and education purposes, it's important to be able 
to hook into existing authentication and authorisation mechanisms. For us, 
that means LDAP - at least for authentication. On the other hand, we also 
have external people using our lists, so we need to be able to either put 
them into an SQL database which will work in conjunction with LDAP, or to 
add a separate LDAP tree for them, or something similar.

Something that I've mentioned before, is the importance of preventing 
collateral spam. So, I'd like to be able to have my MTA ask Mailman whether 
a particular email address is permitted to post to a particular list, at 
SMTP time. I'm using Exim, which could call an external python script, but 
I'd rather be able to issue an SMTP callout to a running daemon, for 
efficiency. The callout would be executed after each "RCPT TO".


-- 
Ian Eiloart
IT Services, University of Sussex
x3148

From thijs at debian.org  Mon Jul  9 16:06:13 2007
From: thijs at debian.org (Thijs Kinkhorst)
Date: Mon, 9 Jul 2007 16:06:13 +0200
Subject: [Mailman-Developers] Mailman roadmap
In-Reply-To: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
Message-ID: <200707091606.15858.thijs@debian.org>

On Saturday 7 July 2007 18:35, Barry Warsaw wrote:
> Mailman 2.1 is (shockingly) four and a half years old, having been
> initially released on 30-Dec-2002.  The last release in the series,
> 2.1.9 was made almost a year ago.  In the meantime, Mark and Tokio
> have been doing a great job maintaining the 2.1 branch, with several
> important patches in the tree now that will eventually become
> 2.1.10.  The problem of course is that we can't add any new features
> to the 2.1 family <wink>, so we should be thinking about a new major
> release.

These sound like sensible plans and I'm curious about what 2.2 and 3.0 will 
bring. However, my question is whether we can expect some 2.1.x releases in 
the short term (like 2.1.10 you mentioned). As you say it will take quite 
some while for 2.2 to be released, and we'd like to get the fixed bugs in the 
2.1.x branch to our users in the meantime.

Regular 2.1.x releases with assorted fixes would be welcome to not scare users 
away from Mailman while we're waiting for the "big" releases.

thanks,
Thijs
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 481 bytes
Desc: not available
Url : http://mail.python.org/pipermail/mailman-developers/attachments/20070709/70764086/attachment.pgp 

From stephen at xemacs.org  Tue Jul 10 05:09:39 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 10 Jul 2007 12:09:39 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <87k5tgxg0j.fsf@athene.jamux.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<87k5tgxg0j.fsf@athene.jamux.com>
Message-ID: <87y7hpt0ng.fsf@uwakimon.sk.tsukuba.ac.jp>

John A. Martin writes:

 > In the absence of a Message-ID
 > on an outgoing mail message many if not most MTAs will add one.  Why
 > not let Mailman anticipate the need to add a Message-ID when archiving
 > the message rather than leaving it to the outgoing MTA?

Quite.

My reason for saying "last resort" is simply that this is not
predictable to third parties.  Eg, I send you (a non-subscriber) a
message with CC and no Message-ID.  You'd like to find the thread in
the archives.  You may as well just do a linear search on that month's
threads.

An URL based on an MD5 of the message body in theory would work, but
in the presence of non-ASCII bodies, structured MIME, ML digests, and
various MTA autoconversions, that seems fragile.


From schwabe at upb.de  Tue Jul 10 08:04:20 2007
From: schwabe at upb.de (Arne Schwabe)
Date: Tue, 10 Jul 2007 08:04:20 +0200
Subject: [Mailman-Developers] Mailman roadmap
In-Reply-To: <DCAEC70D78293D9A88CEF915@lewes.staff.uscs.susx.ac.uk>
References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
	<DCAEC70D78293D9A88CEF915@lewes.staff.uscs.susx.ac.uk>
Message-ID: <46932164.3000007@upb.de>


>> I plan on updating the wiki pages to reflect this thinking, but I
>> would like to get feedback from y'all about the plan.  It would be
>> awesome if we could see a release of Mailman 2.2 some time in late
>> 2007 or early 2008.
>>
>> Comments, question?
>>     
>
> All sounds very good. My two main problems with driving Mailman uptake here 
> on campus are both to do with usability. The first is the current web 
> interface, which was great when it was developed, but expectations have 
> moved on. The second is the lack of a unified user database. So, it's great 
> to see these items listed as the mail focus for 2.2 and 3.0 respectively.
>
> WRT 2.2, I'd like to be able to offer something as simple to use as the 
> list management features of Google Groups (which I use for some voluntary 
> groups that I work with), but with the ability to expose additional 
> functionality on request.
>
>   
At our University we developed a customized mini Interface called
'simple' Interface. The normal mailman Interface is still there, called
'expert admin'. A (non working) demo is here:
https://lists.uni-paderborn.de/listadm/demo.html The code does not use
the mailman template system nor does it have multi language abilities.
It even includes code specific to our installation. (We have a
membership class that maps users to user in ldap and can create dynamic
list with users from ldap + static users)

But maybe something like this should be included in future Mailman
installation. Either a static simple interface or even a customizable
simpe interface that is sufficent for 95% of the people (with well
chosen defaults for your university/organisation)


> WRT 3.0, for enterprise and education purposes, it's important to be able 
> to hook into existing authentication and authorisation mechanisms. For us, 
> that means LDAP - at least for authentication. On the other hand, we also 
> have external people using our lists, so we need to be able to either put 
> them into an SQL database which will work in conjunction with LDAP, or to 
> add a separate LDAP tree for them, or something similar.
>   
This is possible with mailman 2.1 with a self written Mailman
Membershipt class. At least for List Member. If someone really needs
this I could look into polishing the code and making it public.
> Something that I've mentioned before, is the importance of preventing 
> collateral spam. So, I'd like to be able to have my MTA ask Mailman whether 
> a particular email address is permitted to post to a particular list, at 
> SMTP time. I'm using Exim, which could call an external python script, but 
> I'd rather be able to issue an SMTP callout to a running daemon, for 
> efficiency. The callout would be executed after each "RCPT TO".
>
>
>   
Same for Email that get rejected for spam reasons would be neat

Arne


From barry at python.org  Fri Jul 20 14:02:34 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 08:02:34 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 4, 2007, at 3:49 AM, Stephen J. Turnbull wrote:

>> Along with that, I would really like to come up with an algorithm for
>> calculating those urls without talking to the archiver.
>
> Brad didn't like this when I suggested it before, but I didn't really
> understand why not.  Anyway, FWIW:
>
> I suggest adding an X-List-Received-ID header to all messages.  I
> haven't really thought through whether the UUID in that field should
> be at least partly human-readable or not, but that doesn't matter for
> the basic idea.[1]  The on-disk directory format would be
>
> /path-to-archive/private/my-list/Message-ID
>
> for singletons (Message-ID is the author-supplied ID) and
>
> /path-to-archive/private/my-list/Message-ID/List-Received-ID
>
> for multiples.  These would be created on-the-fly when they occur.
> They can be served as static pages.  For almost all messages, the bare
> URL
>
> http://archives.example.com/my-list/Message-ID
>
> should Just Work (ie, return a no-such-object result or a single
> message).  Where it does not, you get an index of all pages with that
> message ID.

I think this suggestion has merit, but I'm going to riff on it a bit.

First, I want to avoid talking about file system layout.  To me,  
that's an implementation detail we needn't worry about right now.   
Maybe the files will live on disk, maybe they'll live in a database,  
maybe they'll live in an external system we don't control.  I don't  
care.  What I want is a uniform way to calculate an address for a  
message given nothing but its text and an interface for retrieving  
messages from a service given that address.  I'm thinking about this  
in a RESTful way, and it's perfectly legitimate for that 'message  
address' to be relative to some archive or message store root.

I've done some experiments.  I took the top 5 mbox files on  
python.org and ran them through a script that looked for message-id  
collisions.  Then I implemented 6 strategies for looking at whether  
the collisions were true collisions or duplicates.    Duplicates are  
defined where every message in the same message-id bucket has the  
same match criteria, and collisions are where at least one message in  
the bucket is different.  So for example, with strategy 2, if the  
message-id and date headers are the same for every message in the  
bucket, it's a dupe, otherwise it's a collision.

While I ran the script over each mbox separately, I think it's more  
interesting to talk about them as a whole collection.  I don't really  
know how representative this would be of the world at large, but it's  
interesting anyway.  FTR, the lists were mailman-users, python-dev,  
python-help, python-list, and tutor.  I think there would be little  
intentional cross-posting between these lists.  Here are the numbers:

total 325146, missing: 624
1. msg.as_string(), dup: 34 (0.0104568409268%), col: 914  
(0.281104488445%)
2. message-id + date, dup: 875 (0.269109876794%), col: 73  
(0.0224514525782%)
3. message-id + 1st received, dup: 270 (0.0830396191249%), col: 678  
(0.208521710247%)
4. message-id + all received, dup: 270 (0.0830396191249%), col: 678  
(0.208521710247%)
5. message-id + date + 1st received, dup: 268 (0.0824245108351%),  
col: 680 (0.209136818537%)
6. body_line_iterator(msg), dup: 659 (0.202678181494%), col: 289  
(0.0888831478782%)

Notice that of 325146 total messages, 624 of them had no message-id  
header.  Even if you aggregate dup+col, you're still looking at a  
total duplicate rate of 0.29%.  While I'm almost tempted to ignore a  
hit rate that low, if you think of an archive holding 1B messages,  
you still get a lot of duplicates.

OTOH, the rate goes down even lower if you consider the message-id  
and date headers.  (Note, I did not consider messages missing a date  
header).  How likely is it that two messages with the same message-id  
and date are /not/ duplicates?  Heck, at that point, I'd feel  
justified in simply automatically rejecting the duplicate and  
chucking it from the archive.

I spent a /little/ time looking at the physical messages that ended  
up as true collisions.  Though by no means did I look at them all,  
they all looked related.  For example, with strategy 2 some messages  
look like they'd been inadvertently sent before they were completed.   
I need to see if there's any similarities in MUA behind these, but  
again, I think we might be able to safely assume that collisions on  
message-id+date can be ignored.

That leads me to the following proposal, which is just an elaboration  
on Stephen's. First, all messages live in the same namespace; they  
are not divided by target mailing list.  Each message has two  
addresses, one is the Message-ID and one is the base32 of the sha1  
hash of the Message-ID + Date.  As Stephen proposes, Mailman would  
add these headers if an incoming message is missing them, and tough  
luck for the non-list copy.  The nice thing is that RFC 2822 requires  
the Date header and states that Message-ID SHOULD be present.

Why the second address?  First, it provides as close to a guaranteed  
unique identifier as we can expect, and second because it produces a  
nearly human readable format.  For example, Stephen's OP would have a  
second address of

 >>> mid
'<87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp>'
 >>> date
'Wed, 04 Jul 2007 16:49:58 +0900'
 >>> # XXX perhaps strip off angle brackets
 >>> h = hashlib.sha1(mid)
 >>> h.update(date)
 >>> base64.b32encode(h.digest())
'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI'

I like base32 instead of base64 because the more limited alphabet  
should produce less ambiguous strings in certain fonts and I don't  
think the short b64 strings are short enough to justify the  
punctuation characters that would result.  While RFC 3548 specifies  
the b32 alphabet as using uppercase characters, I think any service  
that accepts b32 ids should be case insensitive.  A really Postel-y  
service could even accept '1' for 'I' and '0' for 'O' just to make it  
more resilient to human communication errors.

I'd like to come up with a good name for this second address, which  
would suggest the name of the X- header we stash this value in.  X- 
B32-Message-ID isn't very sexy.  Maybe X-Message-Global-ID, since I  
think there's a reasonable argument to make that for well-behaved  
messages, that's exactly what this is.

So now, think of the interface to a message store that supports this  
addressing scheme.  Well it's something like:

class MessageStore(Interface):
     def store_message(message):
         """Store the message.

         :raises ValueError: when the message is missing either the  
Message-ID
         header or a Date header.
         :raises DuplicateMessageError: when a message in the store  
already has
         a matching Message-ID and Date.  An archive is free to raise  
this exception
         for duplicate Message-IDs alone.
         """

     def get_message_by_global_id(key):
         """Locate and return the message from the store that matches  
`key`.

         :param key: The Global ID of the message to locate.  This is  
the
         base32 encoded SHA1 hash of the message's Message-ID and Date
         headers.
         :returns: The message object matching the Global ID, or None  
if there
         is no such match.
         """

     def get_messages_by_message_id(key):
         """Return the set of messages with a matching Message-ID `key`.

         :param key: The Message-ID of the messages to locate.
         :returns: The set of all messages in this store that have  
the given
         Message-ID.  If none such matches are found, the empty set is
         returned.
         """

As far as generating pages based on the Message-ID or global id, I  
agree with Stephen's proposal.  A page returned in response to a  
message-id request could return the message page or it could return  
an index of such messages.  It would be up to the archive whether it  
would accept duplicate Message-IDs or not, but it would always be  
guaranteed that a page returned in response to a global id request  
would return one email message.

Urls could be calculated by concatenating the List-Archive and X- 
Global-Message-ID headers, e.g.

http://mail.python.org/pipermail/mailman-developers/ 
RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

would be the OP.  This could point to the same resource as

http://mail.python.org/pipermail/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
http://mail.python.org/pipermail/global/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

and /might/ point to the same resource as:

http://mail.python.org/pipermail/mailman-developers/ 
87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp
http://mail.python.org/pipermail/mids/ 
87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp

> A minor drawback to my proposal is that if a message gets archived as
> a singleton for that Message-ID, then a duplicate arrives, previously
> created references in the archive will of course now return an index
> rather than the desired message.  Ie, there is data corruption.  This
> can be dealt with in several ways; the easiest would be to provide a
> "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking- 
> for-me"
> link when creating the directory for multiple instances.

Or by using the global id, or by rejecting messages with duplicate  
message ids.

> There's also a *very* minor benefit: repeat sends will be immediately
> recognizable without checking Message-ID.
>
> Footnotes:
> [1]  By partly human-readable I mean containing list-id and date
> information.  The idea would be to have the date come first, so that
> users would have a shot at identifying which of several messages is
> most likely, and this would be searchable by eye with simply an
> ordinary sorted index.

I see searching, indexing, sorting, and providing other human  
readable urls into the message store as a function of the archive.   
Once you're looking at a link to the actual message, you're going to  
be looking at a url that contains the global id, regardless of the  
number of levels you have to go through or redirects involved.

Apologies for letting this thread linger so long.  I'm very  
interesting in hearing your thoughts and if there's general  
agreement, I'll write it up in the wiki.

- -Barry


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCkWnEjvBPtnXfVAQIRhAP7BkuF5K0xOuie2GBqOOWDarksD5Oy49y9
/2WO+u4xH+BttIt3adHJS+K6ETYcK79c5Rf4uwZk40DqWKK7ay1zkxUn/LGXOJ0o
CoWQG5ZyFUJUTkDXtxEWcZ8kkXaDTTSNz2eCtYgQAXw77A95E1SjV0YBs54bFK3A
Bi9cjrKRDcM=
=pyY6
-----END PGP SIGNATURE-----

From barry at python.org  Fri Jul 20 14:19:57 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 08:19:57 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <468BD60A.2020709@Newfield.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>	<468A34B7.3080201@astro.princeton.edu>
	<13B9C232-8295-4533-B49B-205B901AA8E7@python.org>
	<468BD60A.2020709@Newfield.org>
Message-ID: <0FE1D8A1-DFBE-41D3-AE5F-CF0FCF26FB61@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 4, 2007, at 1:16 PM, Dale Newfield wrote:

> Barry Warsaw wrote:
>> Maybe a way to think about this is that the canonical url is based on
>> the message-id, but then there's some way to distill even this down
>> to a tinyurl or simple integer that would be stable in the face of
>> full archive regenerations.
>
> The resistance to basing this on message-id has always been that  
> there's
> no guarantee of uniqueness...
> ...but I believe each list has some sort of counter for how many
> messages it's seen, so we could add another header with that  
> number, and
> use as a unique id the two concatenated together...
> (That way the archiver can know from the content of the header exactly
> how to generate the same unique id as mailman, which would allow  
> for the
> url-in-the-footer to happen w/o first hitting the archiver.)

I'm not crazy about this idea for a couple of reasons.  First, it  
means that someone who has a copy of the message that didn't come  
from the list (e.g. one of the two you will get of this message),  
cannot calculate this unique ID.  Second, things can happen to a list  
that might cause this sequence number to get corrupted.  Maybe a list  
will get deleted and then recreated.  Maybe it will get moved and the  
sequence number will get reset in the move.  Maybe the list will be  
upgraded to a new version of Mailman.

I think we can do just as well by using Message-ID + Date and get  
very low collision rates.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCobXEjvBPtnXfVAQIHFQP/Sz6WVqyFmo0lraw0hyyP5x4AhgBPDQmA
/rFfSBRGbdORLXA2Ss0YdhI5cy8n7LMSsLawgtSt+JA7F5IEiC6Hk5C1M8C+Oe09
4ICYEuuL+gcXPPVc4aYtxp33HvPBFCzPJkGBS2PHaqCQkYIKdWHCtDZ8iLWCOxjc
b674lsQk9tM=
=a09C
-----END PGP SIGNATURE-----

From barry at python.org  Fri Jul 20 14:27:54 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 08:27:54 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707041230m47110705t89cdbe3d2e4802cd@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<468A34B7.3080201@astro.princeton.edu>
	<13B9C232-8295-4533-B49B-205B901AA8E7@python.org>
	<e03b90ae0707041230m47110705t89cdbe3d2e4802cd@mail.gmail.com>
Message-ID: <7B257080-6504-4804-84A1-1EC2F32EB5CB@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 4, 2007, at 3:30 PM, Jeff Breidenbach wrote:

>> Maybe a way to think about this is that the canonical url is based on
>> the message-id, but then there's some way to distill even this down
>> to a tinyurl or simple integer that would be stable in the face of
>> full archive regenerations.
>
> I'd suggest the reverse. Keep the canoncical archive URL short and
> sweet, and then use a URL redirection service to map message-id's
> to those URLs. It is the archiver's job to make it all work. For  
> example,
> the canonical  archive URL might stay exactly the way it is in  
> pipermail.
> But the archival link embedded in the message would instead go
> to a redirection service.

I agree.  My proposed global message id is exactly the canonical  
archive URL, although it's relative to the archiver's base url, as  
given in the List-Archive header.

> http://mail.codeit.com/pipermail/zcommerce/2002-February/000523.html
> http://mail.codeit.com/msgid?002701c4eb3d$07170ca0$3142003e at ADSL
>
> The one other thing I'd ike to revisit is integration with third party
> archival services. There are two obvious integration points; one is a
> button in the Mailman list admin user interface that says "archive  
> with
> service X" not unlike the setting in Firefox that basically says  
> "search
> with service X".

I think we could define an interface that archive services would have  
to meet in order to be available to list admins.  The site admin  
would of course have to enable them site-wide first.  Why kinds of  
information would be required?

- - List-Archive base url
- - Message injection procedure
- - Additional subscription procedures

The nice thing is that if my global id idea works, the injection  
process can be completely asynchronous.

> The other integration point is the archival link
> discussed above. In which case it would be set to something like.
>
> http://third-party-service/msgid?002701c4eb3d$07170ca0$3142003e at ADSL

All we'd need to know is the third party's List-Archive header  
value.  The last part of the path would always be the global message id.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCqSnEjvBPtnXfVAQJq7gQArkmEb3DqrOaRTdYnQ0SCOrqWtiPxNJOd
555+JiHt/mEqPTuS/cF1GfdckwrQXbUJYWeO56dXzfbXtCVaW54h4k/95RI2/mqK
HR2BKcoVW/dDfYUd2V2Vbqdc7trVIy3oGdzQb24Pu9bIptqbdVSpnmx8jm9GIOi1
UAkJp+Ff5nc=
=lE32
-----END PGP SIGNATURE-----

From barry at python.org  Fri Jul 20 14:39:34 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 08:39:34 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <1183651769.10813.6.camel@finch.boston.redhat.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
Message-ID: <C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 5, 2007, at 12:09 PM, John Dennis wrote:

> A little over a year ago I went on a search to find the best open  
> source
> archiver and at that time I came up with Lurker
> (http://lurker.sourceforge.net) Since then I believe Lurker has seen a
> major new revision. I also believe Lurker is the archiver used by
> Debian.
>
> So if you want to leverage existing open source archiving or at least
> look at an example of what would be necessary to allow easy easy
> external archiving integration with Mailman you might want to look at
> Lurker.

I've looked at a few lurker archivers and I wasn't blown away by its  
user interface.  That's apparently highly configurable though.

Lurker's GPL2 so that's fine.  I'd be quite hesitant about shipping  
Mailman with Lurker because it's something we don't control and it's  
not Python.  But I would be totally open to working with the Lurker  
developers on creating an easy bridge between the two systems.   
Perhaps this dovetails with Jeff's suggestion of easier integration  
with external archiving systems.

Does anybody have contacts with the Lurker community that could cross- 
post a new thread to get the discussion going?

(The same goes for any other archiver out there too.)

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCtBnEjvBPtnXfVAQLgJwP9HNu/r/5YYAGn0HcQAhD8b8plDSpm2tao
VcC7tROs0EyjRAQd1b3+hF102FMZzTXF/8LifgETN8K4MD9TXkxNhrTlKjmAUhLG
1tvHZT9oD73aLb81m2SuI3nbp8kQSMncPeMM4u1vGzpXfCYGK4chAPyIJ1Z5MNqj
6byAgVpwZEo=
=qjmf
-----END PGP SIGNATURE-----

From barry at python.org  Fri Jul 20 14:43:11 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 08:43:11 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e13a36b30707072206h37a05ea5mc9169856f2189acf@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<e13a36b30707072206h37a05ea5mc9169856f2189acf@mail.gmail.com>
Message-ID: <94995768-85AC-4F96-8980-B26686B27426@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 8, 2007, at 1:06 AM, Paul Wise wrote:

> My personal opinion is that pipermail should be removed and mailman
> should not contain a default archiver since there are plenty of good
> archivers already (lurker, mhonarc etc). Adding wrappers around them
> would be simpler than reimplementing them.

My hesitation to this has always been the turnkey question.   
Pipermail has it's problems but it /does/ allow small sites to get  
going very quickly with a full(-ish) solution.

It may be that most people get their Mailman installation from their  
distro or hosting service and this is no longer as important.  In  
that case, I still wouldn't chuck Pipermail, but I would try to see  
if we can adopt Jeff's goal of making the archive selection pluggable  
and easily selectable by list admins.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCt4HEjvBPtnXfVAQJHQwP+P4KAQaA7uEeISQjFyb3zoMvOWwgoW3zH
taWsnVAhVmAF/hJBWDn7JtXwWiLw7ngCtGHp3MBKGBKzBjJP7ZizEMNfziaB+OoO
LOyF7sYB+KhKVi+Il7XnHYIjh6DSD8kullP+G/UNtuIsFnNs+aTntndfMKJG2Zct
E7M0F1Ok8FE=
=xXQJ
-----END PGP SIGNATURE-----

From barry at python.org  Fri Jul 20 14:45:14 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 08:45:14 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <87y7hpt0ng.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<87k5tgxg0j.fsf@athene.jamux.com>
	<87y7hpt0ng.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <7194735A-3720-43FB-A1AF-35EEB7DAC271@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 9, 2007, at 11:09 PM, Stephen J. Turnbull wrote:

> John A. Martin writes:
>
>> In the absence of a Message-ID
>> on an outgoing mail message many if not most MTAs will add one.  Why
>> not let Mailman anticipate the need to add a Message-ID when  
>> archiving
>> the message rather than leaving it to the outgoing MTA?
>
> Quite.
>
> My reason for saying "last resort" is simply that this is not
> predictable to third parties.  Eg, I send you (a non-subscriber) a
> message with CC and no Message-ID.  You'd like to find the thread in
> the archives.  You may as well just do a linear search on that month's
> threads.

Yep, and I say "tough".  Let John complain to Stephen to fix his MTA  
to add those Message-IDs so Mailman doesn't have to. ;)

> An URL based on an MD5 of the message body in theory would work, but
> in the presence of non-ASCII bodies, structured MIME, ML digests, and
> various MTA autoconversions, that seems fragile.

Agreed, and it would do no better, in fact worse, than base32(sha1 
(message-id + date))

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCuW3EjvBPtnXfVAQKx/AP9EUxDQmp1tiCEqJqVSFWeicq/9lThnMZN
58UUEPA47wPa1SJSk6z7+0vSfqTskwO1Frnn8OJ6X+MJAxCX4Hr86uBOnK9XW2AK
byCfeYHBdapGlrsxmPd0so+FFJODWWRu7+yyKTw6ApDwVevatEEIMPlZkMALMv5S
axC5ttHfR2E=
=c0pw
-----END PGP SIGNATURE-----

From stephen at xemacs.org  Fri Jul 20 15:21:27 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 20 Jul 2007 22:21:27 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
Message-ID: <87tzrznrc8.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > First, I want to avoid talking about file system layout.  To me,  
 > that's an implementation detail we needn't worry about right now.   

Agreed.

 > How likely is it that two messages with the same message-id and
 > date are /not/ duplicates?

For message id generators that include a time-stamp in the generated
id, approximately the same as the probability that two messages with
the same message-id are not duplicates, no?

 > Heck, at that point, I'd feel justified in simply automatically
 > rejecting the duplicate and chucking it from the archive.

I'd rather not go there.  There may be applications for the archiver
that require that all mail received be filed.

Counterproposal: have a "collisions" namespace, and provide an
interface for the list owner to decide what to do with them.  They
could be thrown away, they could be given an alternative global ID
somehow and added (eg, the archive page could add a "See probable
duplicates too" link), or they could be put into a moderation-like
queue for list admins to decide about.

 > So now, think of the interface to a message store that supports this  
 > addressing scheme.  Well it's something like:

I don't understand how the calling application is supposed to deal
with a DuplicateMessageError exception since it should not change
either the Message-ID or the Date if present.

I see this as a major problem with any proposal to use only author
headers in computing the "global id".

 > Or by using the global id, or by rejecting messages with duplicate  
 > message ids.

Er, the MTA has already accepted it.  Do you plan to generate a list
manager bounce to the poster?  This has the unpleasant misfeature that
it could be used to bounce spam off the list manager, since the poster
needs to see content to determine whether this is a multiple send or
actually the "intended version" after a "fat-finger" send; we already
know the message-id isn't good enough.


From stephen at xemacs.org  Fri Jul 20 15:31:19 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Fri, 20 Jul 2007 22:31:19 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <0FE1D8A1-DFBE-41D3-AE5F-CF0FCF26FB61@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<468A34B7.3080201@astro.princeton.edu>
	<13B9C232-8295-4533-B49B-205B901AA8E7@python.org>
	<468BD60A.2020709@Newfield.org>
	<0FE1D8A1-DFBE-41D3-AE5F-CF0FCF26FB61@python.org>
Message-ID: <87sl7jnqvs.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > Second, things can happen to a list  
 > that might cause this sequence number to get corrupted.

Add an X-Mailman-Sequence-Number header if not already present.

That doesn't deal with your other comments, but as I point out
elsewhere, if you don't use *any* Mailman-specific information in the
global ID, you have no sane way to handle collisions except throw them
away (or make the global ID refer to a collection resource, but that's
kinda unintuitive).

From barry at python.org  Fri Jul 20 15:49:49 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 09:49:49 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <87tzrznrc8.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<87tzrznrc8.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <2EA1C28C-3C27-428C-9A4C-F09039B13A29@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 20, 2007, at 9:21 AM, Stephen J. Turnbull wrote:

>> How likely is it that two messages with the same message-id and
>> date are /not/ duplicates?
>
> For message id generators that include a time-stamp in the generated
> id, approximately the same as the probability that two messages with
> the same message-id are not duplicates, no?

Good point, though clearly not all message-ids have timestamp  
information in them.  It does help explain why I see 600-odd more  
collisions when taking other data into account too.  I've modified my  
script to sort collisions and dupes into maildir folders, so I'll  
take a closer look when that finishes running (it takes a long time  
to slog through all 5 mboxes, even on a fairly zippy dual-G5).

>> Heck, at that point, I'd feel justified in simply automatically
>> rejecting the duplicate and chucking it from the archive.
>
> I'd rather not go there.  There may be applications for the archiver
> that require that all mail received be filed.

True.  It would ultimately be an archiver policy though.

> Counterproposal: have a "collisions" namespace, and provide an
> interface for the list owner to decide what to do with them.  They
> could be thrown away, they could be given an alternative global ID
> somehow and added (eg, the archive page could add a "See probable
> duplicates too" link), or they could be put into a moderation-like
> queue for list admins to decide about.

I like this.

>> So now, think of the interface to a message store that supports this
>> addressing scheme.  Well it's something like:
>
> I don't understand how the calling application is supposed to deal
> with a DuplicateMessageError exception since it should not change
> either the Message-ID or the Date if present.
>
> I see this as a major problem with any proposal to use only author
> headers in computing the "global id".

Mailman would probably log and ignore DuplicateMessageErrors.  It  
wouldn't be Mailman's responsibility to ensure the message gets  
archived, although I concede that as currently defined, you could end  
up with list copies that had a global id header that wasn't unique.   
OTOH, if the archiver implements a collision resolution policy such  
as a 'collisions' namespace, it wouldn't ever raise  
DuplicateMessageError.

>> Or by using the global id, or by rejecting messages with duplicate
>> message ids.
>
> Er, the MTA has already accepted it.  Do you plan to generate a list
> manager bounce to the poster?  This has the unpleasant misfeature that
> it could be used to bounce spam off the list manager, since the poster
> needs to see content to determine whether this is a multiple send or
> actually the "intended version" after a "fat-finger" send; we already
> know the message-id isn't good enough.

Yes, this wouldn't be an MTA bounce, it would be a Mailman bounce.   
But it would have to be subject to the same bounce rules as any other  
auto-response which could be used as a spam vector, e.g. limit the  
number of bounces per time period and don't include the entire  
original message in the bounce (as both can be, and are used as spam  
vectors).

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqC9fnEjvBPtnXfVAQLkEQQAhdu0BIvpRvTk92m9J/sbHVRSRxBGMqta
Cm57WyRJGBxPV3xTE4ghVzXdDyIEvUjKimRTEWbeX60WqROL6FPsmAnwmsYbW3mw
8hqNXj+SpHP+1GIYnYgY9txiM75fHDa5T0VsjpcXAwtjeepHouXAEWbegBUrIzHt
EBp5YCMqxv8=
=5tjc
-----END PGP SIGNATURE-----

From nigel.metheringham at dev.intechnology.co.uk  Fri Jul 20 15:17:26 2007
From: nigel.metheringham at dev.intechnology.co.uk (Nigel Metheringham)
Date: Fri, 20 Jul 2007 14:17:26 +0100
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
Message-ID: <E4476A01-8CF8-4EAF-BDB6-11C43C004185@dev.intechnology.co.uk>


On 20 Jul 2007, at 13:39, Barry Warsaw wrote:
> I've looked at a few lurker archivers and I wasn't blown away by its
> user interface.  That's apparently highly configurable though.

I'd be inclined to agree wrt user interface. Documentation regarding
this, and anything else to do with lurker, appears somewhat scarce -
speaking as someone who has just migrated the exim.org lists to using
lurker archiving. [previously we used mailman with the MHonArc/pipermail
hybrid]

I am considering starting a set of pages within our wiki about use of
lurker (we tend to cover almost everything else about mail so why not
that).

> Lurker's GPL2 so that's fine.  I'd be quite hesitant about shipping
> Mailman with Lurker because it's something we don't control and
> it's not Python.  But I would be totally open to working with the
> Lurker developers on creating an easy bridge between the two systems.
> Perhaps this dovetails with Jeff's suggestion of easier integration
> with external archiving systems.

Integration with externals feels like a good way to go.

> Does anybody have contacts with the Lurker community that could cross-
> post a new thread to get the discussion going?

The ML appears... lacking in vigor..

BTW lurker gives all messages an ID which is 3 parts separated by
periods. The first part is a date field - ie 20070720, the second part
is the receive time, UTC, as 6 digits, and the final part is some form
of hex id. The nice part is if you quote just the first (or first 2)
parts of message ID you get messages around that time...

	Nigel.

--
[ Nigel Metheringham           Nigel.Metheringham at InTechnology.co.uk ]
[ - Comments in this message are my own and not ITO opinion/policy - ]


From barry at python.org  Fri Jul 20 16:07:48 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 10:07:48 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <87sl7jnqvs.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<468A34B7.3080201@astro.princeton.edu>
	<13B9C232-8295-4533-B49B-205B901AA8E7@python.org>
	<468BD60A.2020709@Newfield.org>
	<0FE1D8A1-DFBE-41D3-AE5F-CF0FCF26FB61@python.org>
	<87sl7jnqvs.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <1BC6A9D0-144B-4EE1-90C7-EEBF00396B22@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 20, 2007, at 9:31 AM, Stephen J. Turnbull wrote:

> Barry Warsaw writes:
>
>> Second, things can happen to a list
>> that might cause this sequence number to get corrupted.
>
> Add an X-Mailman-Sequence-Number header if not already present.
>
> That doesn't deal with your other comments, but as I point out
> elsewhere, if you don't use *any* Mailman-specific information in the
> global ID, you have no sane way to handle collisions except throw them
> away (or make the global ID refer to a collection resource, but that's
> kinda unintuitive).

I'd probably call it X-List-Sequence-Number and I'd have to ensure  
that archive copy had that header in it.  OTOH, if I'm going to go to  
the trouble of adding this sequence number, why not just calculate a  
(more likely) gid for the message myself?  If I did that, I could use  
a tinyurl scheme and get much shorter urls.  The archiver would then  
be obliged to use my X-List-GID header verbatim.

I've been pushing for calculating this using non-Mailman headers  
because I'd /like/ for a client receiving the non-list copy to be  
able to make the same calculation.  OTOH, maybe we can have it both  
ways.

So, we calculate the sequence number and generate the following headers:

X-List-Sequence-Number: 801
X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

The latter is composed of purely author generated data, the former is  
supplied by Mailman.

Assuming we also had this header:

List-Archive: http://archive.example.com/gid/

then the following url would point to the same exact resource:

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801

If however we subsequently got a collision, then these two urls would  
address different resources.  E.g.:

X-List-Sequence-Number: 2112
X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

Now the two messages would still be addressable by their respective  
urls:

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801
http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/2112

but

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

would be a disambiguation page.  For a web u/i it would be an HTML  
list containing relative links to '801' and '2112'.  A RESTful XML  
document would contain the set of links to the subordinate pages.  A  
client of the archive.example.com service would have to be prepared  
to handle disambiguation pages if it used only the author generated  
GID, but it would be guaranteed that the full url would lead directly  
to one and only one email message.

Archives would have to recognize the X-List-Sequence-Number and honor  
it whenever it regenerated its archives so that the urls would remain  
stable.

Thinking about this more (and I've been up since about 3:30am so I'm  
a little foggy right now ;), we may want to optimize for fewer dupes  
rather than fewer collisions, or maybe it doesn't matter.  It would  
be interesting to see how big the message-id buckets are when only  
using the Message-ID header.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDBtHEjvBPtnXfVAQLOggQAhIjxlU2jPDb5K8Lfe3NThjgwKiPblqtm
UurUj+AZCffS1ewGDlV6y3GGRnHEzdVSIVvAiATEGTRVG8Zzbbev3GXs0EKYiEyL
FZreNcPqDAPL0KSGw73RdAiwZuszfQcMTsSwOx98zS9Kz0NtbntYQTuqQZwo7wAW
3KeGe2PkpaI=
=yhaZ
-----END PGP SIGNATURE-----

From barry at python.org  Fri Jul 20 16:26:59 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 10:26:59 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <E4476A01-8CF8-4EAF-BDB6-11C43C004185@dev.intechnology.co.uk>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
	<E4476A01-8CF8-4EAF-BDB6-11C43C004185@dev.intechnology.co.uk>
Message-ID: <1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 20, 2007, at 9:17 AM, Nigel Metheringham wrote:

>
> On 20 Jul 2007, at 13:39, Barry Warsaw wrote:
>> I've looked at a few lurker archivers and I wasn't blown away by its
>> user interface.  That's apparently highly configurable though.
>
> I'd be inclined to agree wrt user interface. Documentation regarding
> this, and anything else to do with lurker, appears somewhat scarce -
> speaking as someone who has just migrated the exim.org lists to using
> lurker archiving. [previously we used mailman with the MHonArc/ 
> pipermail
> hybrid]

I noticed that!  There's no documentation link on the site.  I also  
saw your question regarding getting a message out of lurker given its  
message-id.  When I checked yesterday I didn't see a response.

> I am considering starting a set of pages within our wiki about use of
> lurker (we tend to cover almost everything else about mail so why not
> that).

That would be cool.  Feel free to add a link to your pages on the  
Mailman wiki, perhaps here:

http://wiki.list.org/display/DOC/Home

>> Does anybody have contacts with the Lurker community that could  
>> cross-
>> post a new thread to get the discussion going?
>
> The ML appears... lacking in vigor..
>
> BTW lurker gives all messages an ID which is 3 parts separated by
> periods. The first part is a date field - ie 20070720, the second part
> is the receive time, UTC, as 6 digits, and the final part is some form
> of hex id. The nice part is if you quote just the first (or first 2)
> parts of message ID you get messages around that time...

Obviously Mailman can't know the second and third parts so it can't  
use them in its list copies.  I dislike using YYYMMDD because of the  
high number of collisions.

I should make clear that what I'm really proposing is not specific to  
Mailman or any particular archiver.  It's really an interface to a  
generic message store.  We succeed by convincing other mailing list  
software and archivers to adopt the same standard so that they can  
interoperate seamlessly.  We can perhaps have the first  
implementations of this defacto standard (any latent RFC shepherds  
out there? :).  We get everyone else to adopt it when we take over  
the world.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDGNHEjvBPtnXfVAQIwVQQAlwcmmuoXz/vKlpdu27wCHnfpwhhrQMmn
DWMEayuJsG+qg3GvkwyHGkgTBalENdDWWAQpPE9Zf9nmY24FyqhqRpe/QhOCajBV
4+lvXR1FARur4y4E9Lzcjz1TzX3lkaxx3dVCqpOtJxNVVvv442eYsLf11E3Z+wxY
m+ootMkR5pE=
=y4za
-----END PGP SIGNATURE-----

From nigel.metheringham at dev.intechnology.co.uk  Fri Jul 20 16:38:56 2007
From: nigel.metheringham at dev.intechnology.co.uk (Nigel Metheringham)
Date: Fri, 20 Jul 2007 15:38:56 +0100
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
	<E4476A01-8CF8-4EAF-BDB6-11C43C004185@dev.intechnology.co.uk>
	<1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org>
Message-ID: <0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk>


On 20 Jul 2007, at 15:26, Barry Warsaw wrote:
>> BTW lurker gives all messages an ID which is 3 parts separated by
>> periods. The first part is a date field - ie 20070720, the second
>> part is the receive time, UTC, as 6 digits, and the final part
>> is some form of hex id. The nice part is if you quote just the
>> first (or first 2) parts of message ID you get messages around that
>> time...
>
> Obviously Mailman can't know the second and third parts so it can't
> use them in its list copies.  I dislike using YYYMMDD because of the
> high number of collisions.

Its used as part of a UID, but has the nice feature of allowing easy
queries as to other messages at that time.

If the archiver is local you also have the information for part 2 of the
UID - lurker takes it from the From_ line.

Nigel.
--
[ Nigel Metheringham           Nigel.Metheringham at InTechnology.co.uk ]
[ - Comments in this message are my own and not ITO opinion/policy - ]


From barry at python.org  Fri Jul 20 16:52:17 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 10:52:17 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
	<E4476A01-8CF8-4EAF-BDB6-11C43C004185@dev.intechnology.co.uk>
	<1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org>
	<0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk>
Message-ID: <1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Nigel,

On Jul 20, 2007, at 10:38 AM, Nigel Metheringham wrote:

> On 20 Jul 2007, at 15:26, Barry Warsaw wrote:
>>> BTW lurker gives all messages an ID which is 3 parts separated by
>>> periods. The first part is a date field - ie 20070720, the second
>>> part is the receive time, UTC, as 6 digits, and the final part
>>> is some form of hex id. The nice part is if you quote just the
>>> first (or first 2) parts of message ID you get messages around that
>>> time...
>>
>> Obviously Mailman can't know the second and third parts so it can't
>> use them in its list copies.  I dislike using YYYMMDD because of the
>> high number of collisions.
>
> Its used as part of a UID, but has the nice feature of allowing easy
> queries as to other messages at that time.

That should definitely be a way to traverse to the message, but it's  
not the message's global id (a.k.a. canonical address relative to the  
base url of the message store).  An archiver could provide other ways  
to traverse to the message, such as:

/barry at python.org/ to see all messages by me
/barry at python.org/mailman-developers/20070720 to see all messages by  
me today to this mailing list
/Subject?Improving%20the%20archives&sort=thread to find all the  
messages in this thread regardless of when they were posted

etc.

> If the archiver is local you also have the information for part 2  
> of the
> UID - lurker takes it from the From_ line.

Mailman gets the From_ line before passing off to the archiver.  But  
that's interesting, does lurker /require/ the From_ line?

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDMInEjvBPtnXfVAQKJFAP/Y3FsBIXrSaRZ85eCl+pVTZxez2uRn0KB
2OMBV6vS/qC8K1R/myeGpBVr44yE/AfTa+kf+MLSlIlMpJdUlWDMWw2G90IPy1gv
t1VGrwbVPmOlLFxF8kIsi6NKIZpKoJrJVdQnSc+uPCqowIDU9FQ57+2hrH8HayTS
ISAZ0FTgAzk=
=sp+m
-----END PGP SIGNATURE-----

From nigel.metheringham at dev.intechnology.co.uk  Fri Jul 20 16:59:03 2007
From: nigel.metheringham at dev.intechnology.co.uk (Nigel Metheringham)
Date: Fri, 20 Jul 2007 15:59:03 +0100
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
	<E4476A01-8CF8-4EAF-BDB6-11C43C004185@dev.intechnology.co.uk>
	<1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org>
	<0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk>
	<1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org>
Message-ID: <E6B6EE3B-1F4B-4F56-A111-18722AD3D830@dev.intechnology.co.uk>


On 20 Jul 2007, at 15:52, Barry Warsaw wrote:
> Mailman gets the From_ line before passing off to the archiver.   
> But that's interesting, does lurker /require/ the From_ line?
>

Well lurker handles Maildir - no From_ but the same info is in the  
filename, and it can take messages on stdin without a From_ - at  
which point I guess its either faking it (from the headers) or making  
things up.

	Nigel.

--
[ Nigel Metheringham           Nigel.Metheringham at InTechnology.co.uk ]
[ - Comments in this message are my own and not ITO opinion/policy - ]


From barry at python.org  Fri Jul 20 17:16:19 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 11:16:19 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <E6B6EE3B-1F4B-4F56-A111-18722AD3D830@dev.intechnology.co.uk>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
	<E4476A01-8CF8-4EAF-BDB6-11C43C004185@dev.intechnology.co.uk>
	<1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org>
	<0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk>
	<1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org>
	<E6B6EE3B-1F4B-4F56-A111-18722AD3D830@dev.intechnology.co.uk>
Message-ID: <FC1E8521-F20A-4B6F-B0DA-C457D1494432@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 20, 2007, at 10:59 AM, Nigel Metheringham wrote:

> On 20 Jul 2007, at 15:52, Barry Warsaw wrote:
>> Mailman gets the From_ line before passing off to the archiver.
>> But that's interesting, does lurker /require/ the From_ line?
>>
>
> Well lurker handles Maildir - no From_ but the same info is in the
> filename, and it can take messages on stdin without a From_ - at
> which point I guess its either faking it (from the headers) or making
> things up.

Cool.  I wonder if lurker is compatible with Python 2.5's  
mailbox.Maildir implementation and whether the two could share the  
maildirs.  Thanks for the information!

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDRw3EjvBPtnXfVAQJHXwP/SiKhWiZ57thW84RBUWt9QVjf4KISEfRJ
H5lioRVPYYegiJp7rf/08TutkNsxGCHzRd/cdMEFXMkrCAdifLQ2QIdS4LRvEKyY
eRbVHcmxyAlwMbyUq36W+pcH2MutTM64HKNrbL9YRSTaLyMA11FnmaiGIK3RMnbM
AqtLGRSJ8Ec=
=D8oM
-----END PGP SIGNATURE-----

From stephen at xemacs.org  Fri Jul 20 19:21:54 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Sat, 21 Jul 2007 02:21:54 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <2EA1C28C-3C27-428C-9A4C-F09039B13A29@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<87tzrznrc8.fsf@uwakimon.sk.tsukuba.ac.jp>
	<2EA1C28C-3C27-428C-9A4C-F09039B13A29@python.org>
Message-ID: <87odi7ng7h.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > But it would have to be subject to the same bounce rules as any other  
 > auto-response which could be used as a spam vector, e.g. limit the  
 > number of bounces per time period and don't include the entire  
 > original message in the bounce

But that prevents detecting a prematurely sent message, which is
presumably a common use case for genuine collisions.

I just don't think bouncing back is going to be very useful; either
you don't give the user the information he needs to figure out what
happened, or you give the spammers a vector.


From Dale at Newfield.org  Fri Jul 20 01:10:25 2007
From: Dale at Newfield.org (Dale Newfield)
Date: Thu, 19 Jul 2007 19:10:25 -0400
Subject: [Mailman-Developers] Potential solution to protecting email
	addresses in archive...
Message-ID: <469FEF61.7050204@Newfield.org>

http://mailhide.recaptcha.net/

-Dale

D<a 
href="http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&amp;c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0=" 
onclick="window.open('http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&amp;c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0=','','toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); 
return false;" title="Reveal this e-mail address">...</a>@Newfield.org

From barry at python.org  Fri Jul 20 22:52:41 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 16:52:41 -0400
Subject: [Mailman-Developers] Mailman roadmap
In-Reply-To: <DCAEC70D78293D9A88CEF915@lewes.staff.uscs.susx.ac.uk>
References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
	<DCAEC70D78293D9A88CEF915@lewes.staff.uscs.susx.ac.uk>
Message-ID: <0C108CCF-5DF9-4E1F-A88D-39C91757FBA7@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

It's catch up on email day!

On Jul 9, 2007, at 9:35 AM, Ian Eiloart wrote:

> WRT 3.0, for enterprise and education purposes, it's important to  
> be able
> to hook into existing authentication and authorisation mechanisms.  
> For us,
> that means LDAP - at least for authentication. On the other hand,  
> we also
> have external people using our lists, so we need to be able to  
> either put
> them into an SQL database which will work in conjunction with LDAP,  
> or to
> add a separate LDAP tree for them, or something similar.

So, my standard answer to this will be, Mailman will provide  
interfaces for these things and ensure that the application is  
written to only ask these questions through the interface.  It will  
be easy to plug in different backends, so if someone were to write an  
LDAP or hybrid backend, Mailman would work with it.  This will be  
much easier than the current hacks required for Mailman 2.1.  The  
goal would be to support such plugins through Python eggs, so that  
such extensions are easy to install and enable.

> Something that I've mentioned before, is the importance of preventing
> collateral spam. So, I'd like to be able to have my MTA ask Mailman  
> whether
> a particular email address is permitted to post to a particular  
> list, at
> SMTP time. I'm using Exim, which could call an external python  
> script, but
> I'd rather be able to issue an SMTP callout to a running daemon, for
> efficiency. The callout would be executed after each "RCPT TO".

Yes, you'll have this capability.  Really, you could do it now with a  
bit of Python coding.  I'm currently jonesing for a RESTful interface  
to Mailman 3 for controlling it, asking it questions, etc., which of  
course the site-administrator would have to enable.  The only other  
viable alternatives I see are XMLRPC, or the current 'write some  
custom Python script' approach.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqEgmXEjvBPtnXfVAQIp7gP/Xh2SnvuaQScrZdZx2YvCQfPA4IwpdjMN
3yEEw3BOVtvcVM2/mTXel9ZFMx0I9sEvi5U+Fk0E5Bk8/KQ/Nr9Y7SxWxx3mF1UE
ssmLVeNt1k5OziufLwATQcsqAV47YNdj1vcJnhlPuq5k+LMgNDA0XLG2dqFgJ/z5
p1dSCJ96HZ4=
=7TmC
-----END PGP SIGNATURE-----

From barry at python.org  Fri Jul 20 22:56:17 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 16:56:17 -0400
Subject: [Mailman-Developers] Mailman roadmap
In-Reply-To: <200707091606.15858.thijs@debian.org>
References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
	<200707091606.15858.thijs@debian.org>
Message-ID: <D199B684-2182-4902-888D-42FE6E08209E@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 9, 2007, at 10:06 AM, Thijs Kinkhorst wrote:

> These sound like sensible plans and I'm curious about what 2.2 and  
> 3.0 will
> bring. However, my question is whether we can expect some 2.1.x  
> releases in
> the short term (like 2.1.10 you mentioned). As you say it will take  
> quite
> some while for 2.2 to be released, and we'd like to get the fixed  
> bugs in the
> 2.1.x branch to our users in the meantime.
>
> Regular 2.1.x releases with assorted fixes would be welcome to not  
> scare users
> away from Mailman while we're waiting for the "big" releases.

Agreed.  Mark and Tokio will decide when 2.1.10 is ready, and I  
suspect that we will have perhaps a few more point releases before  
2.2 and 3.0 are out.

I would like the upgrade from 2.1.x to 2.2 to be as easy as 2.1.x to  
2.1.x+1.  The upgrade to 3.0 will likely require running an 'export'  
command in 2.x followed by an 'import' command in 3.0.  The really  
tricky parts will be resolving conflicts when merging users.  I  
haven't thought this far ahead, but it may be that there's an  
intervening conflict resolution step, or resolution strategies in the  
'import' command.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqEhcXEjvBPtnXfVAQIKTAP/dpA6VkHY5qkdV9dx6YRQMEYHXuDfcqly
vhSzJ1tJXDIoXkCYa4uMcDTbyFKM3M+ytWR3LbcckGpsApikKgIG8rJz+ik3qIZc
rlm8c4fuevuP9M+uw3S4Z9xK8mxpEBaVvn3rfVywSq9dm4C5zJO0meQMPRz8IPRj
T/J1z613xdI=
=ToXV
-----END PGP SIGNATURE-----

From barry at python.org  Fri Jul 20 22:59:48 2007
From: barry at python.org (Barry Warsaw)
Date: Fri, 20 Jul 2007 16:59:48 -0400
Subject: [Mailman-Developers] Mailman roadmap
In-Reply-To: <4692C472.1020201@uni-paderborn.de>
References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
	<DCAEC70D78293D9A88CEF915@lewes.staff.uscs.susx.ac.uk>
	<4692C472.1020201@uni-paderborn.de>
Message-ID: <CCE9B097-5D2D-445E-A0CD-645CC3DFBE48@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 9, 2007, at 7:27 PM, Arne Schwabe wrote:

> At our University we developed a customized mini Interface called
> 'simple' Interface. The normal mailman Interface is still there,  
> called
> 'expert admin'. A (non working) demo is here:
> https://lists.uni-paderborn.de/listadm/demo.html The code does not use
> the mailman template system nor does it have multi language abilities.
> It even includes code specific to our installation. (We have a
> membership class that maps users to user in ldap and can create  
> dynamic
> list with users from ldap + static users)
>
> But maybe something like this should be included in future Mailman
> installation. Either a static simple interface or even a customizable
> simpe interface that is sufficent for 95% of the people (with well
> chosen defaults for your university/organisation)

This is exactly the kind of thing that would be nice to have.  I'd  
also like for 3.0 to have a 'simple' and 'expert' list admin  
interface.  BTW, your page makes me wish either 1) I spoke German :)  
or 2) Google Translate could handle your URL!

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqEiRHEjvBPtnXfVAQIfLwQAr25h5wkRKWtvOAXOVfONnFlQEcyG1rYb
tTomQyD5FlSAZO6MeoPEW04QtwWjv9Q3tpxFlf4tRqaPmkdb6pd6WgLy7xkVlSn5
IlTSQdnehB2CNHOKHNMiXHrl45OxxDU9PDPPYyMhaZDNPzaEfT4ad8xay3ktn2cc
2zmW/oLziUo=
=HJOq
-----END PGP SIGNATURE-----

From barry at python.org  Sat Jul 21 22:25:07 2007
From: barry at python.org (Barry Warsaw)
Date: Sat, 21 Jul 2007 16:25:07 -0400
Subject: [Mailman-Developers] Major updates to Mailman 3.0 branch
Message-ID: <61705022-9B57-47A5-BC3C-BCE160EBD80A@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've just merged in my 'setuptools' branch to the official Mailman 3  
branch.  This means that all the autoconf-based building cruft is  
gone, dead, and buried.  Can you say "Yay"? :)

Instead, Mailman is now a setuptools based project, which is rapidly  
becoming the standard for Python applications and libraries.  There  
are lots and lots of benefits to this change; for developers, the  
most immediate benefit is that there's no more configure/make/ 
makeinstall dance for every little change.  It means you can build a  
'development' installation of Mailman and edit and test right in  
place.  This will hugely reduce the overhead for developing the code.

Also gone are the C wrapper programs, so Mailman is now a pure Python  
application.  With the change to wsgi-based web integration, and  
maildir delivery from the MTA, we really don't need them any more.

You'll notice a bunch of other things have disappeared too, like the  
3rd party packages we were distributing in the misc/ directory.   
Instead, these are downloaded on demand from the Python Cheeseshop  
<http://cheeseshop.python.org>.  All the packages Mailman depends on  
live in the Cheeseshop so we don't need to distribute them any more.   
And now that we're a setuptools-based project, when you build Mailman  
(see below), these dependent packages will be automatically  
downloaded and installed as necessary.  You'll need a net connection  
for the initial build, but after that, once the packages are  
installed, you're good to go.

To prepare your existing branch for the update, start by doing a  
'make distclean' followed by a 'bzr revert'.  IMPORTANT - don't do  
this if you have uncommitted changes!  'bzr stat' should now report  
no changes.

Next, cd into your 'misc' directory and remove the following  
directories:

- - Elixir-0.3.0
- - SQLAlchemy-0.3.3
- - setuptools-0.6c3
- - zope.interface-3.3.0.1

These are the unpacked dependent packages and you don't need them any  
more (in fact, they'll get in your way now).  Next, remove any  
residual .pyc files laying around, via:

% find . -name \*.pyc -print | xargs rm

Now do your 'bzr pull' to get the latest 3.0 branch changes.  If you  
have local modifications, you'll need to do a 'bzr merge' and resolve  
any conflicts.

To see if everything's cool, pick a 'development' directory.  I  
usually use a subdirectory called 'staging' in my 3.0 working tree.   
This development directory can be anything, but then do this:

% export PYTHONPATH=<my staging dir>

for me "<my staging dir>" would be `pwd`/staging.  Make sure this  
directory exists.

Next do this:

% python2.5 setup.py develop --install-dir <my staging dir>

After churning for a bit, downloaded some stuff, etc., look in your  
staging directory.  You'll have a bin directory there with all the  
Mailman command line scripts, but all the code will remain in your  
working tree.

You can now do this to run the test suite:

% <my staging dir>/bin/testall

You should see 72 tests passing and no failures.

Though you probably won't get very far, the way this is going to work  
in deployed installations is that you'll have to run bin/ 
make_instance to create some directories and do a few other things  
that the configure and Makefile.in's used to do.  After running that,  
you'll have an etc/mailman.cfg file that you can tweak for your  
installation.  Note that you don't need to do this for bin/testall  
because the testing infrastructure creates a temporary instance that  
it uses.

Let me now if you have any problems.
- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqJro3EjvBPtnXfVAQL0OgQAktryl9zYZlZrm9dC644EL2hF+FMHs50v
2JTh2w6UhULAdpj+xneKR04pEfCw6HHNult03jo6NoNjYjMmzpUycDHXj7e0RaKE
mbhMkX2v2b6d01OsMrAbAyWBTZH+2rtmjEKANd/1/+LlIdy3KlJe09m+xgNY9VsV
keEIEdVQcYc=
=J3Mq
-----END PGP SIGNATURE-----

From amk at amk.ca  Sat Jul 21 23:16:11 2007
From: amk at amk.ca (A.M. Kuchling)
Date: Sat, 21 Jul 2007 17:16:11 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <FC1E8521-F20A-4B6F-B0DA-C457D1494432@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
	<E4476A01-8CF8-4EAF-BDB6-11C43C004185@dev.intechnology.co.uk>
	<1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org>
	<0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk>
	<1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org>
	<E6B6EE3B-1F4B-4F56-A111-18722AD3D830@dev.intechnology.co.uk>
	<FC1E8521-F20A-4B6F-B0DA-C457D1494432@python.org>
Message-ID: <20070721211611.GA26947@andrew-kuchlings-computer.local>

On Fri, Jul 20, 2007 at 11:16:19AM -0400, Barry Warsaw wrote:
> Cool.  I wonder if lurker is compatible with Python 2.5's  
> mailbox.Maildir implementation and whether the two could share the  
> maildirs.  Thanks for the information!

It had better be -- Maildir has a published specification.  If there's
an incompatibility, that would be a bug in either mailbox.py or
lurker.

--amk


From fil at rezo.net  Sun Jul 22 11:18:03 2007
From: fil at rezo.net (Fil)
Date: Sun, 22 Jul 2007 11:18:03 +0200
Subject: [Mailman-Developers] Potential solution to protecting email
	addresses in archive...
In-Reply-To: <469FEF61.7050204@Newfield.org>
References: <469FEF61.7050204@Newfield.org>
Message-ID: <bfc33ad70707220218t33d9d115xcf2fcab2841e01ed@mail.gmail.com>

> http://mailhide.recaptcha.net/

excellent

> D<a
> href="http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&amp;c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0="
> onclick="window.open('http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&amp;c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0=','','toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300');
> return false;" title="Reveal this e-mail address">...</a>@Newfield.org

window.open(this.href) will do :-)

-- Fil

From barry at python.org  Sun Jul 22 15:26:59 2007
From: barry at python.org (Barry Warsaw)
Date: Sun, 22 Jul 2007 09:26:59 -0400
Subject: [Mailman-Developers] Potential solution to protecting email
	addresses in archive...
In-Reply-To: <bfc33ad70707220218t33d9d115xcf2fcab2841e01ed@mail.gmail.com>
References: <469FEF61.7050204@Newfield.org>
	<bfc33ad70707220218t33d9d115xcf2fcab2841e01ed@mail.gmail.com>
Message-ID: <B6980A08-57EE-44CD-9CF6-2B10D6831FA3@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 22, 2007, at 5:18 AM, Fil wrote:

>> http://mailhide.recaptcha.net/
>
> excellent

I like this one better:

http://www.thehumorarchives.com/joke/Best_Captcha_Ever

:)

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqNbJHEjvBPtnXfVAQLoUQQAjIAt/75iK0F0lViDLAFwaeb25H5INKCY
kIs/jt4shtEmXpXbW81JXHD4reaDcB8UOnv9cavtorPaOXaIaGTds/m4yUdqjlli
yKA9LLTEd0ys6LJhuwh774m2XpPLpi/V6i6owf8ojTtW/pm8C62G2/Zlvo8wq10p
8CsZxlkVaq8=
=l7LJ
-----END PGP SIGNATURE-----

From terri at zone12.com  Sun Jul 22 18:33:04 2007
From: terri at zone12.com (Terri Oda)
Date: Sun, 22 Jul 2007 12:33:04 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
Message-ID: <74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com>

On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:
> I've looked at a few lurker archivers and I wasn't blown away by its
> user interface.  That's apparently highly configurable though.

I've been doing a lot of thinking about interface, and I'm coming to  
the conclusion that something more like a web bulletin board is  
probably the way to go, given that people use them all the time  
without much trouble and with a fairly minimal amount of whining. ;)  
I'm trying to use interfaces to things like comment systems (which  
are often threaded -- picture the slashdot stuff, maybe?) and popular  
boards like phpbb (which isn't threaded beyond separate topics) as  
guides to how people usually deal with conversations on the web.

It'd actually be fairly easy, at that point, to just put a posting  
interface into the archives (yes, you'd have to be logged in, and  
yes, this means your password becomes that bit more valuable because  
someone having it can pose as you to the list... but they could do  
that by spoofing your email address so I'm not too concerned). But  
then people who don't like email or just want to pop by and check the  
list quickly could actually use mailman like a web board, which is  
something I'm pretty sure would get used (I know my users have asked  
for it in the past).

I've been drafting simple prototype interfaces in my head, trying to  
keep potential architectures in mind.  I'm hoping I'll have time this  
week to code some up HTML and see how well they actually work when  
they're not just inside my head. :)

  Terri


From Dale at Newfield.org  Sun Jul 22 22:40:04 2007
From: Dale at Newfield.org (Dale Newfield)
Date: Sun, 22 Jul 2007 16:40:04 -0400
Subject: [Mailman-Developers] Potential solution to protecting email
 addresses in archive...
In-Reply-To: <bfc33ad70707220218t33d9d115xcf2fcab2841e01ed@mail.gmail.com>
References: <469FEF61.7050204@Newfield.org>
	<bfc33ad70707220218t33d9d115xcf2fcab2841e01ed@mail.gmail.com>
Message-ID: <46A3C0A4.4070302@Newfield.org>

Fil wrote:
>> D<a
>> href="http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&amp;c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0=" 
>>
>> onclick="window.open('http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&amp;c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0=','','toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); 
>>
>> return false;" title="Reveal this e-mail address">...</a>@Newfield.org
> 
> window.open(this.href) will do :-)


Yeah--that gunk is what they suggest as a replacement, but not what I 
ended up using.  Just the url is sufficient (albeit long).  (Since I'm 
depending upon outside resources to make this work, why not rely on 
*both* tinyurl.com *and* recaptcha.net ? :-)

-Dale Newfield
  http://tinyurl.com/2r49tj

From Dale at Newfield.org  Sun Jul 22 23:17:32 2007
From: Dale at Newfield.org (Dale Newfield)
Date: Sun, 22 Jul 2007 17:17:32 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>	<849198AE-DEC3-44C8-A090-470720624185@python.org>	<1183651769.10813.6.camel@finch.boston.redhat.com>	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
	<74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com>
Message-ID: <46A3C96C.3060106@Newfield.org>

Terri Oda wrote:
> I've been doing a lot of thinking about interface, and I'm coming to  
> the conclusion that something more like a web bulletin board is  
> probably the way to go

For public lists, the answer may lie in external tools like nabble.com 
or mailinglistarchive.com

Of course, that doesn't help for lists wishing to keep their content 
private.

-Dale

From jag at fsf.org  Mon Jul 23 15:32:44 2007
From: jag at fsf.org (Joshua 'jag' Ginsberg)
Date: Mon, 23 Jul 2007 09:32:44 -0400
Subject: [Mailman-Developers] Mailman roadmap
In-Reply-To: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org>
Message-ID: <46A4ADFC.9020209@fsf.org>

I apologize for the late chiming in here.

I'd like to propose the XMLRPC extension for Mailman 2.2 that has been
developed over the last two years. I have in the last couple months
updated the patch to 2.1.9 and it's code impact is really quite minimal;
it's really quite standalone. And for those looking for ways to interact
with the Mailman database from other applications, it provides the
necessary interfaces for administration and moderation functions.

Thoughts?

-jag

Barry Warsaw wrote:
> Now that we've successfully navigated the switch to Bazaar, it's time  
> to lay out plans for future Mailman releases.  I've talked to several  
> people about what to do about Mailman's future and I'd like to take  
> this opportunity to describe my thoughts and get your feedback.   
> First some background.
> 
> Mailman 2.1 is (shockingly) four and a half years old, having been  
> initially released on 30-Dec-2002.  The last release in the series,  
> 2.1.9 was made almost a year ago.  In the meantime, Mark and Tokio  
> have been doing a great job maintaining the 2.1 branch, with several  
> important patches in the tree now that will eventually become  
> 2.1.10.  The problem of course is that we can't add any new features  
> to the 2.1 family <wink>, so we should be thinking about a new major  
> release.
> 
> I've been making good progress on the SQAlchemy/Elixir version, which  
> will finally get rid of pickles and put Mailman on a Real Database  
> (tm).  It's been clear to me for a while that this branch will have a  
> unified user database.  It simply makes no sense to build the  
> database back-end without once and for all fixing this design  
> constraint.  I've always said that the unified user database will be  
> in Mailman 3, and thus this branch is indeed called "Mailman 3.0".
> 
> I've been slowly building things back up from the ground floor.  The  
> basic data model is in pretty good shape and I'm taking a religious  
> test-driven approach to making things work again.  But the branch  
> still needs a lot of work, and I have no ETA for Mailman 3.0.
> 
> In the meantime, Andrew Kuchling and others have volunteered to work  
> on modernizing the Mailman web u/i, and Terri recently started a  
> thread discussing updates to the archiver.  I think it makes sense to  
> bless these efforts, towards the goal of releasing them in Mailman  
> 2.2.  I intend to create an official Mailman 2.2 branch in bzr where  
> these efforts can land as they mature.  My hope of course is that  
> we'll also be able to use much of this new code for Mailman 3.
> 
> I'd like to keep the changes for 2.2 focused on the web u/i and  
> archiver, with a small number of additional features to be  
> determined.  Mailman 2.2 should see no changes to the basic  
> architecture or 'database'; we'll continue to use pickles by default  
> for Mailman 2.2.  While I won't rule out other new features, I want  
> to be very picky about those that are accepted for 2.2, and would not  
> feel bad at all if we rejected or deferred until 3.0 most of those  
> proposed.  Criteria for other 2.2 features must include minimal code  
> impact with a high degree of reliability and stability.
> 
> I plan on updating the wiki pages to reflect this thinking, but I  
> would like to get feedback from y'all about the plan.  It would be  
> awesome if we could see a release of Mailman 2.2 some time in late  
> 2007 or early 2008.
> 
> Comments, question?
> 
> -Barry
> 
_______________________________________________
Mailman-Developers mailing list
Mailman-Developers at python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
Searchable Archives:
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe:
http://mail.python.org/mailman/options/mailman-developers/jag%40fsf.org

Security Policy:
http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : http://mail.python.org/pipermail/mailman-developers/attachments/20070723/35001c17/attachment.pgp 

From jeff at jab.org  Tue Jul 24 08:02:46 2007
From: jeff at jab.org (Jeff Breidenbach)
Date: Mon, 23 Jul 2007 23:02:46 -0700
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
Message-ID: <e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>

> Notice that of 325146 total messages, 624 of them had no message-id
> header.  Even if you aggregate dup+col, you're still looking at a
> total duplicate rate of 0.29%.

Message ID's are supposed to be unique. This is discussed in
in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places.
If that's not the case, the mail transfer agent is broken. I think it's
better to go ahead and use the mesage-id, rather than concoct
yet another "this time we mean it!" unique identifier. This is a
cost/benefit thing; the cost is some real world collisions, the benefit
is a conceptually simpler system. Conceptually simpler things are
good especially when implemented all over the place.

Which brings me to suggestion #2, which is go ahead and write
an RFC on how list servers should embed archival links in messages.
This sounds like an internet wide interoperability issue as much as
something mailman specific. Why not come up with a scheme usable
by all list servers? And also describe a specification third party archival
services can comply to. Besides, I've always wanted to help write
an RFC. If we go that route, it would be good to get input from a range
of people - one person I'd suggest is Earl Hood, author of mhonarc.

Thoughts?

Jeff


 While I'm almost tempted to ignore a
> hit rate that low, if you think of an archive holding 1B messages,
> you still get a lot of duplicates.
>
> OTOH, the rate goes down even lower if you consider the message-id
> and date headers.  (Note, I did not consider messages missing a date
> header).  How likely is it that two messages with the same message-id
> and date are /not/ duplicates?  Heck, at that point, I'd feel
> justified in simply automatically rejecting the duplicate and
> chucking it from the archive.
>
> I spent a /little/ time looking at the physical messages that ended
> up as true collisions.  Though by no means did I look at them all,
> they all looked related.  For example, with strategy 2 some messages
> look like they'd been inadvertently sent before they were completed.
> I need to see if there's any similarities in MUA behind these, but
> again, I think we might be able to safely assume that collisions on
> message-id+date can be ignored.
>
> That leads me to the following proposal, which is just an elaboration
> on Stephen's. First, all messages live in the same namespace; they
> are not divided by target mailing list.  Each message has two
> addresses, one is the Message-ID and one is the base32 of the sha1
> hash of the Message-ID + Date.  As Stephen proposes, Mailman would
> add these headers if an incoming message is missing them, and tough
> luck for the non-list copy.  The nice thing is that RFC 2822 requires
> the Date header and states that Message-ID SHOULD be present.
>
> Why the second address?  First, it provides as close to a guaranteed
> unique identifier as we can expect, and second because it produces a
> nearly human readable format.  For example, Stephen's OP would have a
> second address of
>
>  >>> mid
> '<87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp>'
>  >>> date
> 'Wed, 04 Jul 2007 16:49:58 +0900'
>  >>> # XXX perhaps strip off angle brackets
>  >>> h = hashlib.sha1(mid)
>  >>> h.update(date)
>  >>> base64.b32encode(h.digest())
> 'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI'
>
> I like base32 instead of base64 because the more limited alphabet
> should produce less ambiguous strings in certain fonts and I don't
> think the short b64 strings are short enough to justify the
> punctuation characters that would result.  While RFC 3548 specifies
> the b32 alphabet as using uppercase characters, I think any service
> that accepts b32 ids should be case insensitive.  A really Postel-y
> service could even accept '1' for 'I' and '0' for 'O' just to make it
> more resilient to human communication errors.
>
> I'd like to come up with a good name for this second address, which
> would suggest the name of the X- header we stash this value in.  X-
> B32-Message-ID isn't very sexy.  Maybe X-Message-Global-ID, since I
> think there's a reasonable argument to make that for well-behaved
> messages, that's exactly what this is.
>
> So now, think of the interface to a message store that supports this
> addressing scheme.  Well it's something like:
>
> class MessageStore(Interface):
>      def store_message(message):
>          """Store the message.
>
>          :raises ValueError: when the message is missing either the
> Message-ID
>          header or a Date header.
>          :raises DuplicateMessageError: when a message in the store
> already has
>          a matching Message-ID and Date.  An archive is free to raise
> this exception
>          for duplicate Message-IDs alone.
>          """
>
>      def get_message_by_global_id(key):
>          """Locate and return the message from the store that matches
> `key`.
>
>          :param key: The Global ID of the message to locate.  This is
> the
>          base32 encoded SHA1 hash of the message's Message-ID and Date
>          headers.
>          :returns: The message object matching the Global ID, or None
> if there
>          is no such match.
>          """
>
>      def get_messages_by_message_id(key):
>          """Return the set of messages with a matching Message-ID `key`.
>
>          :param key: The Message-ID of the messages to locate.
>          :returns: The set of all messages in this store that have
> the given
>          Message-ID.  If none such matches are found, the empty set is
>          returned.
>          """
>
> As far as generating pages based on the Message-ID or global id, I
> agree with Stephen's proposal.  A page returned in response to a
> message-id request could return the message page or it could return
> an index of such messages.  It would be up to the archive whether it
> would accept duplicate Message-IDs or not, but it would always be
> guaranteed that a page returned in response to a global id request
> would return one email message.
>
> Urls could be calculated by concatenating the List-Archive and X-
> Global-Message-ID headers, e.g.
>
> http://mail.python.org/pipermail/mailman-developers/
> RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
>
> would be the OP.  This could point to the same resource as
>
> http://mail.python.org/pipermail/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
> http://mail.python.org/pipermail/global/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
>
> and /might/ point to the same resource as:
>
> http://mail.python.org/pipermail/mailman-developers/
> 87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp
> http://mail.python.org/pipermail/mids/
> 87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp
>
> > A minor drawback to my proposal is that if a message gets archived as
> > a singleton for that Message-ID, then a duplicate arrives, previously
> > created references in the archive will of course now return an index
> > rather than the desired message.  Ie, there is data corruption.  This
> > can be dealt with in several ways; the easiest would be to provide a
> > "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking-
> > for-me"
> > link when creating the directory for multiple instances.
>
> Or by using the global id, or by rejecting messages with duplicate
> message ids.
>
> > There's also a *very* minor benefit: repeat sends will be immediately
> > recognizable without checking Message-ID.
> >
> > Footnotes:
> > [1]  By partly human-readable I mean containing list-id and date
> > information.  The idea would be to have the date come first, so that
> > users would have a shot at identifying which of several messages is
> > most likely, and this would be searchable by eye with simply an
> > ordinary sorted index.
>
> I see searching, indexing, sorting, and providing other human
> readable urls into the message store as a function of the archive.
> Once you're looking at a link to the actual message, you're going to
> be looking at a url that contains the global id, regardless of the
> number of levels you have to go through or redirects involved.
>
> Apologies for letting this thread linger so long.  I'm very
> interesting in hearing your thoughts and if there's general
> agreement, I'll write it up in the wiki.
>
> - -Barry
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (Darwin)
>
> iQCVAwUBRqCkWnEjvBPtnXfVAQIRhAP7BkuF5K0xOuie2GBqOOWDarksD5Oy49y9
> /2WO+u4xH+BttIt3adHJS+K6ETYcK79c5Rf4uwZk40DqWKK7ay1zkxUn/LGXOJ0o
> CoWQG5ZyFUJUTkDXtxEWcZ8kkXaDTTSNz2eCtYgQAXw77A95E1SjV0YBs54bFK3A
> Bi9cjrKRDcM=
> =pyY6
> -----END PGP SIGNATURE-----
> _______________________________________________
> Mailman-Developers mailing list
> Mailman-Developers at python.org
> http://mail.python.org/mailman/listinfo/mailman-developers
> Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
> Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/
> Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org
>
> Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp
>

From stephen at xemacs.org  Tue Jul 24 08:56:35 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 24 Jul 2007 15:56:35 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
Message-ID: <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>

Jeff Breidenbach writes:

 > > Notice that of 325146 total messages, 624 of them had no message-id
 > > header.  Even if you aggregate dup+col, you're still looking at a
 > > total duplicate rate of 0.29%.
 > 
 > Message ID's are supposed to be unique.

Fortunately, a rule more honored in the observance than the breach.
Nonetheless, it *is* breached.  The Postel Principle applies here, IMO.

 > better to go ahead and use the mesage-id, rather than concoct
 > yet another "this time we mean it!" unique identifier.

That's not the point.  We're not going to impose this on senders;
that's what Message-ID is for, as you say.  If a sender won't provide
a proper Message-ID, third parties who get a CC are just out of luck.

I simply think we should be prepared for applications where relying on
the sender to supply a UUID is not acceptable; we need to be able to
provide one ourselves.  Creating UUIDs is a solved problem, after all.
So we just specify a header to put it in, and subscribers will be able
to use it, per definition of a canonical URL.

Then we say that an archive SHOULD provide access to the resource via
Message-ID if available, and define how to construct that URL from the
List-Archive and Message-ID headers.

 > Which brings me to suggestion #2, which is go ahead and write
 > an RFC on how list servers should embed archival links in messages.

I think Barry already suggested that?  Anyway, +1.  But remember, a
standards-track RFC should have a working implementation to point to.


From jam at jamux.com  Tue Jul 24 13:10:55 2007
From: jam at jamux.com (John A. Martin)
Date: Tue, 24 Jul 2007 07:10:55 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> (Stephen J. Turnbull's
	message of "Tue, 24 Jul 2007 15:56:35 +0900")
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <877ioqys3k.fsf@athene.jamux.com>

A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 154 bytes
Desc: not available
Url : http://mail.python.org/pipermail/mailman-developers/attachments/20070724/a44ea72c/attachment.pgp 

From stephen at xemacs.org  Tue Jul 24 13:55:26 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Tue, 24 Jul 2007 20:55:26 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <877ioqys3k.fsf@athene.jamux.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
Message-ID: <87tzrukocx.fsf@uwakimon.sk.tsukuba.ac.jp>

John A. Martin writes:

 >     >> better to go ahead and use the mesage-id, rather than concoct
 >     >> yet another "this time we mean it!" unique identifier.
 > 
 >     st> That's not the point.  We're not going to impose this on
 >     st> senders;
 > 
 > I read the quote as meaning "this time we mean it really is unique",
 > imposing nothing on senders.

Ah.  If so, my reply is "if you want something done right, do it
yourself."  *All robust databases assign a unique ID to each record.*
Why shouldn't a mailing list archive do so?

 > Right.  Maybe that will encourage compliance.  The complexity of
 > catering to brokenness in this instance may be too high a price to
 > impose on the all.

What complexity?  Mailman just does

   msg['X-List-Archive-Received-ID'] = Email.msgid()

(or however the message ID generator is spelled).  After that, it's up
to the archiver whether to do anything with it or not.  I proposed a
way that it could be used; if that's considered too complex, fine.
But simply assigning one is not complex or otherwise very costly.

From jeff at jab.org  Tue Jul 24 18:31:55 2007
From: jeff at jab.org (Jeff Breidenbach)
Date: Tue, 24 Jul 2007 09:31:55 -0700
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <877ioqys3k.fsf@athene.jamux.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
Message-ID: <e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>

There are three different parties coming to the table. One is
the mail transfer agent of the sender, another is the list server,
and the third is the archive server. Ideally, all three will be happy
campers.

>So we just specify a header to put it in, and subscribers will be able
>to use it, per definition of a canonical URL.

It is the archive server's job to decide what is the "canonical" URL
for a message. There's a good chance these archival URLs will be
served by an HTTP redirect. So let's not use the word canonical. :)

>What complexity?  Mailman just does
>
>  msg['X-List-Archive-Received-ID'] = Email.msgid()

Easy to introduce, harder to deal with. The archival server would now
keep track of both the message-id and the x-list-archive-received-id.
That's two namespaces that almost do the same thing. It's easier
for the archive server to keep track of one name space than two,
and - most importantly - conceptually simpler.

>From the perspective of the assorted list servers, it's easier to
do nothing than to do something. So if they can get by with
just message-id (which is already implemented) not have to add
x-list-archive-received-id, that's a smoother implementation path.
If we base on message-id, archival servers will be able to
retroactively add support for all their stored messages, even those
that are ten years old. And users holding an old message will be
able to figure out that URL without doing any computational
gymnastics.

Put another way, there's the possibility to reduce the archive
servers' implementation to "search for this mesage-id" which is
something really useful to have anyway, and therefore likely to
get wider support.

In addition, Barry was talking about concocting a unique
identifier from the Date field and Message-ID. I'm not a big fan of
this idea, because the date field comes from the mail user agent
and is often wildly corrupt; e;g; coming from 100 years in the future.
Very painful if the archive is showing most recent message first.
Therefore an archival server is very likely to determine message date
from the most recent received header (generally from a trusted mail
transfer agent) rather than the date field. From the archive server's
perspective, the best thing to do with the date field is throw it away.

So for these reasons, I'd rather stick with message-id and risk
some real world collisions, instead of introduce another identifier.
If the list server receives a message with no message-id, by all means
create one on the spot.  To me, this feels like the sweet spot in terms
of cost benefit. The main thing that bugs me is message-ids are long,
which makes them awkward to embed in a URL in the footer of a
message.

Jeff

From Dale at Newfield.org  Tue Jul 24 18:43:29 2007
From: Dale at Newfield.org (Dale Newfield)
Date: Tue, 24 Jul 2007 12:43:29 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>	<849198AE-DEC3-44C8-A090-470720624185@python.org>	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
Message-ID: <46A62C31.5010501@Newfield.org>

Jeff Breidenbach wrote:
> In addition, Barry was talking about concocting a unique
> identifier from the Date field and Message-ID. I'm not a big fan of
> this idea, because the date field comes from the mail user agent
> and is often wildly corrupt; e;g; coming from 100 years in the future.

Oh--I was assuming the Date to which he was referring was the current 
timestamp at which mailman was processing the message.  I was going to 
say that this guarantees uniqueness, but I guess there are parallel 
mailman implementations where more than one machine/processor are all 
serving the same list, and then two different machines/processors might 
wind up with identical timestamps while processing two different messages.

-Dale

From terri at zone12.com  Tue Jul 24 19:11:17 2007
From: terri at zone12.com (Terri Oda)
Date: Tue, 24 Jul 2007 13:11:17 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
Message-ID: <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com>

On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:
>> So we just specify a header to put it in, and subscribers will be  
>> able
>> to use it, per definition of a canonical URL.
> It is the archive server's job to decide what is the "canonical" URL
> for a message. There's a good chance these archival URLs will be
> served by an HTTP redirect. So let's not use the word canonical. :)

Someone already pointed out that the message ID is a bit long for a  
URL, so I'm guessing we're going to want some sort of shorter  
sequence number for messages for linking purposes.

Regardless of whether we *need* to generate our own unique ID, I'm  
leaning towards the thought that we're going to *want* to generate  
our own for usability reasons.  In a perfect world, i think we'd have  
a sequence number so I could visit http://example.com/mailman/ 
archives/listname/204.html and know that 205.html would be the next  
message to that list, but any short unique id would do if sequence  
numbers are too much of a pain.

It seems silly to generate nice short links but then use message-id.   
If we can generate nice short links, we might as well use 'em  
throughout, unless you really think the default use of the archive  
will be to search it by messageid (which I sincerely doubt, from my  
user experiences).

  Terri


From jeff at jab.org  Tue Jul 24 20:03:09 2007
From: jeff at jab.org (Jeff Breidenbach)
Date: Tue, 24 Jul 2007 11:03:09 -0700
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com>
Message-ID: <e03b90ae0707241103qaba04b5hc72d2ee9d2d66f1@mail.gmail.com>

> Regardless of whether we *need* to generate our own unique ID, I'm
> leaning towards the thought that we're going to *want* to generate
> our own for usability reasons.  In a perfect world, i think we'd have
> a sequence number so I could visit http://example.com/mailman/
> archives/listname/204.html and know that 205.html would be the next
> message to that list, but any short unique id would do if sequence
> numbers are too much of a pain.

I agree there's a lot of usability benefits from short URLs, but perhaps
this is the job of the archive server, and not the list server. Mharc (an
archive server) is a great example here. Mharc's canonical message
format is pretty human friendly.

http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg00000.html

Unfortunately, there's no trivial way for the list server to know that human
friendly URL when the message is sent out. Fortunately, Mharc is also
happy handles messages by message-id, which the list server does know
about.

http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc-users&i=200208010532.g715W0e31774 at gator.earlhood.com

Had I been the implementer, I'd probably have made mharc do an HTTP 302
redirect from the longer URL to the shorter URL. But that's besides the point.
The point is we have an existing, working, happy archival server, and it would
be really nice if list servers (such as mailman) were compatible. And by
compatible, I mean offering the capability of embedding an archival URL in the
footers of messages.

-Jeff

From barry at python.org  Tue Jul 24 20:41:38 2007
From: barry at python.org (Barry Warsaw)
Date: Tue, 24 Jul 2007 14:41:38 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<1183651769.10813.6.camel@finch.boston.redhat.com>
	<C72813CC-AD17-4486-A8C8-D237C7E75D77@python.org>
	<74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com>
Message-ID: <EBBA2840-4253-4A4C-A5DD-DBAFB8BC1025@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 22, 2007, at 12:33 PM, Terri Oda wrote:

> On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:
>> I've looked at a few lurker archivers and I wasn't blown away by its
>> user interface.  That's apparently highly configurable though.
>
> I've been doing a lot of thinking about interface, and I'm coming to
> the conclusion that something more like a web bulletin board is
> probably the way to go, given that people use them all the time
> without much trouble and with a fairly minimal amount of whining. ;)

I like this for several reasons.  I've long wanted a bridge between  
the traditional mailing list and a forum because to me they're  
related along a spectrum of emotional investment.

What I mean is this.  For the subjects and projects I care deeply  
about, I join the mailing list.  I want to be intimately involved in  
the day-to-day collaboration that being subscribed gives me.  I care  
enough about that that I'm willing to put up with the pain that comes  
along with mailing lists, such as the overhead for subscribing,  
deleting topics I don't care about, the occasional spam, the overhead  
of going on vacation or leaving the list, etc.

But there are even more topics or projects that I have only a  
fleeting interest in.  Say I find a bug in some X program, or wake up  
and decide to learn how to use setuptools, or find that some recent  
update broke my Linux server.  In all those cases, I might want to  
start a thread of discussion or ask a question, and be very involved  
in that thread for a week or two.  Then, my interest wanes, or I get  
my question answered, or other projects pique my interest.  Mailing  
lists are pretty bad at managing those kinds of fleeting involvement,  
but forums are quite nice.  There's usually fairly low overhead (and  
probably even less if OpenID and such were in widespread adoption)  
for joining, and when I lose interest the forum doesn't fill up my  
inbox.  OTOH, forums seem good for short 'instant' messages, but not  
so good (IMO) for free ranging, detailed discussions.  So there's a  
spectrum.

> I'm trying to use interfaces to things like comment systems (which
> are often threaded -- picture the slashdot stuff, maybe?) and popular
> boards like phpbb (which isn't threaded beyond separate topics) as
> guides to how people usually deal with conversations on the web.
>
> It'd actually be fairly easy, at that point, to just put a posting
> interface into the archives (yes, you'd have to be logged in, and
> yes, this means your password becomes that bit more valuable because
> someone having it can pose as you to the list... but they could do
> that by spoofing your email address so I'm not too concerned). But
> then people who don't like email or just want to pop by and check the
> list quickly could actually use mailman like a web board, which is
> something I'm pretty sure would get used (I know my users have asked
> for it in the past).

Heck, /I'd/ use it, so what more justification do we need? :)

> I've been drafting simple prototype interfaces in my head, trying to
> keep potential architectures in mind.  I'm hoping I'll have time this
> week to code some up HTML and see how well they actually work when
> they're not just inside my head. :)

I'd love to see the prototypes once you've committed them to HTML.   
The one important thing is that the individual postings will need the  
equivalent of a stable archive URL (i.e. permlink) that could be  
passed around, added to web pages, etc.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZH43EjvBPtnXfVAQLrzQP8CG5ALhX+Wk91I+jri20R60C7cqtCzQby
V9MD8FlhC/7LbRW3QXwJnwWSpXCnBYhShxmRMn2maEeIXqPUEBl3QOcUYkHxeRZG
zV6sKE1J1EZfbUTY7CM3lcnOZKHB1n07PGslcxQsJHEmnbuHbR7bm+2AV2CknzZj
8Y/9XxPjX5Q=
=IRq2
-----END PGP SIGNATURE-----

From barry at python.org  Tue Jul 24 20:44:27 2007
From: barry at python.org (Barry Warsaw)
Date: Tue, 24 Jul 2007 14:44:27 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
Message-ID: <C051DAD7-B7FF-41A2-9A28-695930959145@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote:

> Which brings me to suggestion #2, which is go ahead and write
> an RFC on how list servers should embed archival links in messages.
> This sounds like an internet wide interoperability issue as much as
> something mailman specific. Why not come up with a scheme usable
> by all list servers? And also describe a specification third party  
> archival
> services can comply to. Besides, I've always wanted to help write
> an RFC. If we go that route, it would be good to get input from a  
> range
> of people - one person I'd suggest is Earl Hood, author of mhonarc.

I've always thought that an RFC-like spec that describes how a  
generic mailing list manager would interoperate with a generic  
archiving service is the way to go.  I've written up a somewhat more  
formal spec of what I've implemented MM3 currently here:

http://wiki.list.org/display/DEV/Stable+URLs

If this looks good, I'd be happy to approach some of the related  
communities to try to get buy-in.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZIjHEjvBPtnXfVAQLK9AP/VQveYtFuZhJam9TITYBuMyc8pig7nqDt
efn4DIXhZhgtqBQ58/TgEFZnTkKfiZ1HLdoovrQye8HdKZmuAd+SJrOkq/aO9fIC
ZgaV5HYBD7TcnQuO2z5eRuK3IY7FpWoeZrn/a6sxBObsaSOrOTjhqs1gv5go24d3
8CmG/bB9LTo=
=EyoU
-----END PGP SIGNATURE-----

From barry at python.org  Tue Jul 24 20:53:25 2007
From: barry at python.org (Barry Warsaw)
Date: Tue, 24 Jul 2007 14:53:25 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <419AFBBF-82FF-4939-B85B-85055A1B8482@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote:

> I simply think we should be prepared for applications where relying on
> the sender to supply a UUID is not acceptable; we need to be able to
> provide one ourselves.  Creating UUIDs is a solved problem, after all.
> So we just specify a header to put it in, and subscribers will be able
> to use it, per definition of a canonical URL.
>
> Then we say that an archive SHOULD provide access to the resource via
> Message-ID if available, and define how to construct that URL from the
> List-Archive and Message-ID headers.

I think there's two approaches we could argue for.  One is for the  
mailing list manager to craft a UUID out of whole cloth and stick  
that in a header.  Then any downstream archiver would be obliged to  
use that header value as the canonical address of the message, with  
an alternative path to the message via the Message-ID (possibly  
returning a list of matching messages when there are collisions).

The second approach, and the one that I favor, is to use the Message- 
ID (and the Date) header on the original message as the UUID,  
properly handling corner cases like duplicate headers or missing  
header.  This UUID servers as the basis for the address to the  
message resource just like above.

I like the second approach better because in the case where you start  
with an off-list copy of the message, you have a decent enough chance  
of getting to the archived message, or at least to a resource  
containing a link to the message.  The first alternative would  
require access to the list copy.

Imagine if every archiver supported my proposal, knowing just the  
Message-ID and Date header, you could get to that message from almost  
anywhere, just by using the UUID as a relative URL rooted at say  
http://www.mail-archive.com, http://groups.google.com, http:// 
mail.python.org/pipermail, or whatever.  That would be pretty neat.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZKpnEjvBPtnXfVAQJWcwP6A6SqHTeft+c/5IeSpRsI+gvtPJW94fcG
pjB66oYiKco7U+rZtxll3TPD9Ta7gccohq72sh8hV7CHRW7Cd531Hq91z7QktHUW
zqzxkMimoca7WlUxr0/ElyPNhRkjMlR8LvhNCjs4a9O6/PpzBTNjsXwaTKfLrqO3
N5iq3BWoMK8=
=fSNC
-----END PGP SIGNATURE-----

From barry at python.org  Tue Jul 24 21:11:27 2007
From: barry at python.org (Barry Warsaw)
Date: Tue, 24 Jul 2007 15:11:27 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
Message-ID: <FF2E18AA-8137-4509-9780-B4CCA77CAE82@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote:

>> What complexity?  Mailman just does
>>
>>  msg['X-List-Archive-Received-ID'] = Email.msgid()
>
> Easy to introduce, harder to deal with. The archival server would now
> keep track of both the message-id and the x-list-archive-received-id.
> That's two namespaces that almost do the same thing. It's easier
> for the archive server to keep track of one name space than two,
> and - most importantly - conceptually simpler.

True, but an archiver already has to handle collisions on the Message- 
ID so in a sense, you have to maintain multiple paths to the same  
message, don't you?

So I like my proposal because it imposing nothing additional on the  
MUA or MTA, a tiny bit more on the MLM, and some extra work (though I  
think not much) on the archiving agent.  What you gain from my  
proposal over a pure Message-ID approach is guaranteed uniqueness  
given the list copy, and human friendlier urls.

>> From the perspective of the assorted list servers, it's easier to
> do nothing than to do something. So if they can get by with
> just message-id (which is already implemented) not have to add
> x-list-archive-received-id, that's a smoother implementation path.
> If we base on message-id, archival servers will be able to
> retroactively add support for all their stored messages, even those
> that are ten years old. And users holding an old message will be
> able to figure out that URL without doing any computational
> gymnastics.

All these are still true with my proposal, except with the  
observation as Stephen points out that given a URL based on sender- 
provided headers, you must be prepared to deal with collisions, so  
sometimes your resources will return lists.  The advantage of adding  
a bit of MLM-provided information is that given the list copy you can  
guarantee uniqueness, and given the off-list copy you can get to a  
resource that contains a link to the message you want.

> Put another way, there's the possibility to reduce the archive
> servers' implementation to "search for this mesage-id" which is
> something really useful to have anyway, and therefore likely to
> get wider support.
>
> In addition, Barry was talking about concocting a unique
> identifier from the Date field and Message-ID. I'm not a big fan of
> this idea, because the date field comes from the mail user agent
> and is often wildly corrupt; e;g; coming from 100 years in the future.
> Very painful if the archive is showing most recent message first.
> Therefore an archival server is very likely to determine message date
> from the most recent received header (generally from a trusted mail
> transfer agent) rather than the date field. From the archive server's
> perspective, the best thing to do with the date field is throw it  
> away.

Throw it away or hide it?  The former would be a problem, but not the  
latter.  Does your archiver keep a canonical copy of the message as  
you received it?  If so, then you preserve the original Date header  
enough for the calculation to occur, even if you hide the Date  
header, or display a Received header date when you render it to  
HTML.  That doesn't matter of course.

But I should point out that I'm not married to including the Date  
header in the hash.  I like it because it appears to reduce  
collisions which I care about.  But I still like using the base32  
sha1 hash instead of the raw Message-ID because I think it's easier  
for humans to use, read, speak, and copy.  Of course this doesn't  
mean that you need to disable your search-by-Message-ID feature!

> So for these reasons, I'd rather stick with message-id and risk
> some real world collisions, instead of introduce another identifier.
> If the list server receives a message with no message-id, by all means
> create one on the spot.  To me, this feels like the sweet spot in  
> terms
> of cost benefit. The main thing that bugs me is message-ids are long,
> which makes them awkward to embed in a URL in the footer of a
> message.

Another advantage for the URL scheme I propose.  You know you're  
going to end up with URLs of len(host-prefix) + 32 + 1 + #digits-in- 
seqno

(32 == base32(sha1digest(data))
(1 == / divider)
(#digits-in-seqno == e.g. len(str(seqno))

You should be able to keep things in the 60-70 character range,  
including the host name.  That doesn't seem too bad.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZO4HEjvBPtnXfVAQIYGwP/VZPCiQrg9CTeMThApNTh7xUismbW0AiT
1N6a8DusXDBrqiLDQd+v2/R5KOV+TnwDNlIcl5FfFatHxWJ0bGy850kT/nhrHdKU
UrW0hR8PWSMIRN5Bqx9bL9cvaMigAoyX+njAfiDgl0yy7arbAm66GH1HNH3c1XGT
1/qaGckINUg=
=4uwH
-----END PGP SIGNATURE-----

From turnbull at sk.tsukuba.ac.jp  Wed Jul 25 05:04:23 2007
From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull)
Date: Wed, 25 Jul 2007 12:04:23 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
Message-ID: <87hcntkwug.fsf@uwakimon.sk.tsukuba.ac.jp>

Jeff Breidenbach writes:

 > >So we just specify a header to put it in, and subscribers will be able
 > >to use it, per definition of a canonical URL.
 > 
 > It is the archive server's job to decide what is the "canonical" URL
 > for a message. There's a good chance these archival URLs will be
 > served by an HTTP redirect. So let's not use the word canonical. :)

If it's not going to be "canonical" (I forget if there's a standard
for that word :), what is the point in writing an RFC?

 > >What complexity?  Mailman just does
 > >
 > >  msg['X-List-Archive-Received-ID'] = Email.msgid()
 > 
 > Easy to introduce, harder to deal with. The archival server would now
 > keep track of both the message-id and the x-list-archive-received-id.
 > That's two namespaces that almost do the same thing.

The implementations are similar, and there is "nearly" a one-to-one
correspondence.  But the semantics are very different.  Message-ID is
untrustworthy, the internal ID is trustworthy.

 > So for these reasons, I'd rather stick with message-id and risk
 > some real world collisions, instead of introduce another identifier.

Go ahead and stick with message-id if *you* like, but please don't
tell *me* what risks I have to accept.

There needs to be a way to *enforce* uniqueness, and it *must* be
specified by the RFC in order for archive implementations to be
interoperable.  Note that word "specify"; I do not insist that this
level of robustness be *required*.  But if we don't specify it now,
people who want such robustness will have to do all this work again,
and possibly will end up with something that some servers conforming
to "your" RFC will not conform to.

It is possible that most archivers will simply use the message ID, and
do something brutal in the rare case of a collision.  That's fine.
But an archiver that wants to provide a canonical URL which is
guaranteed to uniquely and losslessly identify a post in its archive
should have a standard way to do that.

 > The main thing that bugs me is message-ids are long, which makes
 > them awkward to embed in a URL in the footer of a message.

The footer URL is of no concern in this discussion.  There is not
going to be a requirement that footer URLs be "canonical", not if I
have any say in the matter.  The "canonical" URL will be in (or be
constructed from) the message header.


From jeff at jab.org  Wed Jul 25 06:47:23 2007
From: jeff at jab.org (Jeff Breidenbach)
Date: Tue, 24 Jul 2007 21:47:23 -0700
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <FF2E18AA-8137-4509-9780-B4CCA77CAE82@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<FF2E18AA-8137-4509-9780-B4CCA77CAE82@python.org>
Message-ID: <e03b90ae0707242147v6a04f50dkf35a11b67f7ac63f@mail.gmail.com>

> What you gain from my proposal over a pure Message-ID approach
> is guaranteed uniqueness given the list copy

Guarantee is a pretty strong word. A malicious person could post two
messages with the same message-id, same date, but different bodies.
Sometimes the channel between the MLM and the archive server will
be SMTP, and spurious messages can be injected. Finally, from the archive
server's perspective, some of the MLMs might make mistakes - just like
from the MLM's perspective, some of MTAs might make mistakes in
setting message-id. So I don't think the proposed SHA1(date, message-id)
scheme buys a hard guarantee of uniqueness. Every component has
to protect themselves, but none can solve the world's problems.

So that moves us to how many collisions are reduced in practice.
I have a question about the numbers Barry mined from the python
lists. Are the collisions really that high? One should not count
messages without a message-id, because the MLM can and should
create one in that case.

One should also not count collisions of messages going to different
lists. Here's why. Let's say message M is cross posted to lists L1 and
L2. Even though it is the same message, there are now two different
contexts. (For example, people visit M at archive L1 should get a
completely different experience if they hit "next message" and people
visiting M at archive L2.)

So I'd be curious what the collision numbers come to with these two
factors taken into account. The other takeaway  is list name really
should be part of the URL to get proper context. The earlier example
from Mharc does this.

> and human friendlier urls.

That's a very compelling point.

SHA1 can't be computed inside someone's head or simple cut-n-pasted
together for old messages,  but I think the usability benefits of short
URLs (short enough that they can comfortably fit inside message bodies)
outweighs this drawback. By the way, is SHA-1 still in favor? My
impression was it was fading away after the Shandong University team
partially cracked it.

> Throw it away or hide [Date]?  The former would be a problem,
> but not the latter.

Thrown away. My favorite archival service is based on mhonarc,
and raw mail goes into offline cold storage. Of course this can be
changed for the future messages with some pain, but there's no
reasonable way for myself (or any other mhonarc users in the
same predicament) to retrofit against Date based URLs. For the
record, here's what mhonarc embeds in each HTML page it
produces because these were considered the important headers.
In this message sent from Australia, the date shows a timezone
of UTC -0700, because it was pulled from the received header.

<!-- MHonArc v2.6.15 -->
<!--X-Subject: [Gossip] Re: green&#45;travel resources {webliographies} -->
<!--X-From-R13: "[nephf Z. Saqvpbgg" <zraqvpbgNlnubb.pbz> -->
<!--X-Date: Wed, 26 Apr 2006 00:27:27 &#45;0700 -->
<!--X-Message-Id: 20060426072529.45761.qmail at web54507.mail.yahoo.com -->
<!--X-Content-Type: text/plain -->
<!--X-Reference: e03b90ae0604242000q70a81fcete7da4965c581c838 at mail.gmail.com -->
<!--X-Head-End-->

So my main request is to double check the numbers, see if using
"Date" really buys as much as one thinks. I'll keep digesting the
other aspects of the wiki page.

From barry at python.org  Wed Jul 25 15:06:32 2007
From: barry at python.org (Barry Warsaw)
Date: Wed, 25 Jul 2007 09:06:32 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com>
Message-ID: <F0935B34-0264-430E-B59D-4B8ADEA31DB9@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:

> On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:
>>> So we just specify a header to put it in, and subscribers will be
>>> able
>>> to use it, per definition of a canonical URL.
>> It is the archive server's job to decide what is the "canonical" URL
>> for a message. There's a good chance these archival URLs will be
>> served by an HTTP redirect. So let's not use the word canonical. :)
>
> Someone already pointed out that the message ID is a bit long for a
> URL, so I'm guessing we're going to want some sort of shorter
> sequence number for messages for linking purposes.

Yes, definitely.  What do you think of the base32 examples I have on  
the wiki page?

> Regardless of whether we *need* to generate our own unique ID, I'm
> leaning towards the thought that we're going to *want* to generate
> our own for usability reasons.  In a perfect world, i think we'd have
> a sequence number so I could visit http://example.com/mailman/
> archives/listname/204.html and know that 205.html would be the next
> message to that list, but any short unique id would do if sequence
> numbers are too much of a pain.
>
> It seems silly to generate nice short links but then use message-id.
> If we can generate nice short links, we might as well use 'em
> throughout, unless you really think the default use of the archive
> will be to search it by messageid (which I sincerely doubt, from my
> user experiences).

We'd want sequence numbers in the urls if we think people will hand  
edit them, say in a browser location bar.  I'm not sure that's a  
common enough use case.

Pipermail currently uses sequence numbers but there are big problems  
with that.  First, the mbox'ing algorithm wasn't always correct so  
while sequence numbers were accurate when generating the html  
archives on the fly, they broke horribly when you try to regenerate  
them from an mbox file.  It's also why we have tools like cleanarch  
which tries to unbreak earlier mboxing bugs by crufty heuristics.   
This /might/ be solved by ditching mboxes for maildir or some other  
canonical raw archiving format (not a bad idea in its own right), but  
manual surgery on the raw archives could still break it.  Sometimes  
site admins just /have/ to remove messages, disrupting the sequencing.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdK2XEjvBPtnXfVAQKfDQP/ToPZ3t7+uIyMrsThOr+PVQ7aKVT/BQ7F
OgKqFSDSma4ZofQOkPgr4ZFRT1yKRURWas7jI2zQ8ADPAOKCYh0Udgq6XjpOI8mI
7/pODazVkbwzT9Oo06pGwpzaONK4eZjt1y9IDb9VkniUcAyve5EQ+5+KaG3rbo4M
wsrCnHLkvSE=
=/z/f
-----END PGP SIGNATURE-----

From barry at python.org  Wed Jul 25 15:10:37 2007
From: barry at python.org (Barry Warsaw)
Date: Wed, 25 Jul 2007 09:10:37 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707241103qaba04b5hc72d2ee9d2d66f1@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com>
	<e03b90ae0707241103qaba04b5hc72d2ee9d2d66f1@mail.gmail.com>
Message-ID: <128BEC04-DFA5-4D9E-B813-8091FE3DEE94@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 24, 2007, at 2:03 PM, Jeff Breidenbach wrote:

>> Regardless of whether we *need* to generate our own unique ID, I'm
>> leaning towards the thought that we're going to *want* to generate
>> our own for usability reasons.  In a perfect world, i think we'd have
>> a sequence number so I could visit http://example.com/mailman/
>> archives/listname/204.html and know that 205.html would be the next
>> message to that list, but any short unique id would do if sequence
>> numbers are too much of a pain.
>
> I agree there's a lot of usability benefits from short URLs, but  
> perhaps
> this is the job of the archive server, and not the list server.  
> Mharc (an
> archive server) is a great example here. Mharc's canonical message
> format is pretty human friendly.
>
> http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg00000.html
>
> Unfortunately, there's no trivial way for the list server to know  
> that human
> friendly URL when the message is sent out. Fortunately, Mharc is also
> happy handles messages by message-id, which the list server does know
> about.
>
> http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc- 
> users&i=200208010532.g715W0e31774 at gator.earlhood.com
>
> Had I been the implementer, I'd probably have made mharc do an HTTP  
> 302
> redirect from the longer URL to the shorter URL. But that's besides  
> the point.
> The point is we have an existing, working, happy archival server,  
> and it would
> be really nice if list servers (such as mailman) were compatible.  
> And by
> compatible, I mean offering the capability of embedding an archival  
> URL in the
> footers of messages.

I agree, I just don't think message-ids are user friendly enough to  
be this canonical url.  Especially in this context, which is exactly  
where urls are thrown in users faces.  An archiving service is  
exactly the right place for redirecting human readable urls to the  
archiver's canonical url (by, I agree, 302).

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdLznEjvBPtnXfVAQJtxgQAiLp7TjnLoOLnpoxfli2gBo6fdU6ZIFb0
SKiuRgLAoTSdnJymYWOww2U/vTJ3HqR2dZNFCfGeVHgzoHpiX87WiZDJ4Sx1Jec8
7BpIO1ZokGI2NhHiSscYC5k4iCzce17lVGkyVzfYlFysmFKsFjcDIpV8wQFleeG9
TneLaMXT2eY=
=1tKI
-----END PGP SIGNATURE-----

From barry at python.org  Wed Jul 25 15:17:13 2007
From: barry at python.org (Barry Warsaw)
Date: Wed, 25 Jul 2007 09:17:13 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <87hcntkwug.fsf@uwakimon.sk.tsukuba.ac.jp>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<87hcntkwug.fsf@uwakimon.sk.tsukuba.ac.jp>
Message-ID: <8530573B-F935-43A8-AD22-CAEC776807D4@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 24, 2007, at 11:04 PM, Stephen J. Turnbull wrote:

>>> So we just specify a header to put it in, and subscribers will be  
>>> able
>>> to use it, per definition of a canonical URL.
>>
>> It is the archive server's job to decide what is the "canonical" URL
>> for a message. There's a good chance these archival URLs will be
>> served by an HTTP redirect. So let's not use the word canonical. :)
>
> If it's not going to be "canonical" (I forget if there's a standard
> for that word :), what is the point in writing an RFC?

I completely agree.  Maybe "interoperable" is the right word to use.   
Or "user friendly interoperable archive url" which is really what  
we're trying to define here (IMO).

> There needs to be a way to *enforce* uniqueness, and it *must* be
> specified by the RFC in order for archive implementations to be
> interoperable.  Note that word "specify"; I do not insist that this
> level of robustness be *required*.  But if we don't specify it now,
> people who want such robustness will have to do all this work again,
> and possibly will end up with something that some servers conforming
> to "your" RFC will not conform to.

Yep.

> It is possible that most archivers will simply use the message ID, and
> do something brutal in the rare case of a collision.  That's fine.
> But an archiver that wants to provide a canonical URL which is
> guaranteed to uniquely and losslessly identify a post in its archive
> should have a standard way to do that.

Yep.

>> The main thing that bugs me is message-ids are long, which makes
>> them awkward to embed in a URL in the footer of a message.
>
> The footer URL is of no concern in this discussion.  There is not
> going to be a requirement that footer URLs be "canonical", not if I
> have any say in the matter.  The "canonical" URL will be in (or be
> constructed from) the message header.

Agreed in the sense that the RFC 2822 headers must contain all the  
information necessary to construct the canonical url (or must contain  
the canonical url).  A list server /can/ decorate the message with  
the url in other ways, but that certainly isn't necessary.

You might even imagine a mail reader extension that read the  
appropriate List-* headers and added a button "View In Archive" which  
sent the canonical url to your web browser.  Once that happens, the  
archive service is free to redirect to its hearts content.  I submit  
though that any good archive service (and certainly Pipermail++ if I  
can help it) will ensure that those urls are stable forever,  
otherwise people will stop relying on it.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdNWnEjvBPtnXfVAQIZRAP/Ux9rUK6ToH5Zl2XTC8LOKgCG+1yhf4pw
h4XVZc0nmP1xxFttsXzsuY+/oGFW8yrY0yGnxK4N5EKUEpIxejGNbVtAjpQ5l/Sy
ml5R5kDhZtk/d8tE9IXOzB5zCcxdmMgjX3KfL78t5L6JzAQ4RgM0MTYxPH69AdHW
zpvhBCow/z8=
=KiqU
-----END PGP SIGNATURE-----

From barry at python.org  Wed Jul 25 15:34:13 2007
From: barry at python.org (Barry Warsaw)
Date: Wed, 25 Jul 2007 09:34:13 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707242147v6a04f50dkf35a11b67f7ac63f@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<FF2E18AA-8137-4509-9780-B4CCA77CAE82@python.org>
	<e03b90ae0707242147v6a04f50dkf35a11b67f7ac63f@mail.gmail.com>
Message-ID: <2009EA3C-9E11-4B2A-BF57-A62C0EB11870@python.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote:

>> What you gain from my proposal over a pure Message-ID approach
>> is guaranteed uniqueness given the list copy
>
> Guarantee is a pretty strong word. A malicious person could post two
> messages with the same message-id, same date, but different bodies.

No question, if the archive service and the list server are not  
intimately connected, the communication channel between the two can  
be subverted.  There are ways that channel's trust could be enhanced  
though, for example by the list server signing its headers in a dkim- 
like fashion.

But in situations where the two are co-located, you can trust these  
headers even without that enhancement.

> So that moves us to how many collisions are reduced in practice.
> I have a question about the numbers Barry mined from the python
> lists. Are the collisions really that high? One should not count
> messages without a message-id, because the MLM can and should
> create one in that case.

I've uploaded the script I used to here:

http://wiki.list.org/download/attachments/786633/scan.py?version=1

It's probably not perfect, and certainly the python.org mbox's may  
not be representative enough of the real world.  Please grab the  
script, tweak it and run it over your own raw archives; it should be  
easily modified to handle any of the mailbox formats supported by  
Python 2.5's mailbox module.

If you improve the script or find numbers that lead to different  
conclusions, now's the time to know!

>> and human friendlier urls.
>
> That's a very compelling point.
>
> SHA1 can't be computed inside someone's head or simple cut-n-pasted
> together for old messages,  but I think the usability benefits of  
> short
> URLs (short enough that they can comfortably fit inside message  
> bodies)
> outweighs this drawback. By the way, is SHA-1 still in favor? My
> impression was it was fading away after the Shandong University team
> partially cracked it.

We're not concerned with the cryptographic security claims of SHA1.   
I don't see any economically beneficial attack on the archives  
against SHA1 here.  I think SHA1 is reasonably universally available,  
and marginally better than MD5, so it's probably good enough for this  
application.

You're right that no one is going to do SHA1 in their heads, and if  
they could, they're probably working for some TLA in a secret gubmit  
basement lab somewhere.  The point of course is that a /program/  
could easily apply the algorithm to a very minimal existing message  
and come up with the same canonical url.  This enables all kinds of  
cool applications based on REST-y principles or whatever.  The fact  
that the algorithm leads to short(ish), largely unambiguous (to  
humans), readable urls is an important benefit -- probably /the/ most  
important benefit.

>> Throw it away or hide [Date]?  The former would be a problem,
>> but not the latter.
>
> Thrown away.

Really?  Wow.  I'd have thought every archiving service would want to  
keep a record of the raw message it received on the wire.  That would  
allow it to regenerate the html archive if necessary, provide useful  
forensics, and allow for exactly the kind of data mining we're doing  
here.  I can't see /any/ reason for not saving the raw messages in  
their entirety, especially for a public list.  Maybe for a private  
one, where your data retention policies require you delete things  
after a certain amount of time, but even there, I can't see why you'd  
want to trim raw messages rather than just chucking them entirely.

> My favorite archival service is based on mhonarc,
> and raw mail goes into offline cold storage.

What's the advantage of that?  Isn't disk space cheap as dirt?   
Probably cheaper if you've bought any topsoil recently :).  Still,  
the raw messages are still available right?  So if there was enough  
value in calculating the canonical urls so that the archive service  
could be seen as an interoperability good citizen, then it could be  
done.

I'll just reiterate that I'm not married to including the Date header  
in the algorithm.  Until proven otherwise by more research, I think  
it's a good idea to use because 1) it's required by RFC 2822 and 2)  
it seems to reduce collisions.  I think the algorithm I propose would  
work just as well with Message-IDs alone, although there's more of a  
chance that the non-sequence numbered url will return multiple matches.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdRVnEjvBPtnXfVAQJiOgP/UIufdisvgVPV3qKo4dV2bfWoUPcp/dIQ
iGj9faWXFwa/NoOk3HtIZbu7JVrJEY2t9nihJX6lEjZ1Q6AFH1hkObx0dV5NRfj2
KjRANxU6UsBvpDCzBQWthX1d7HviRJ74Pio5hVti+0YoV4pjq8UHaxTlrECHmkad
ERlOYR2onAQ=
=8b8I
-----END PGP SIGNATURE-----

From jfesler at gigo.com  Wed Jul 25 16:29:02 2007
From: jfesler at gigo.com (Jason Fesler)
Date: Wed, 25 Jul 2007 07:29:02 -0700 (PDT)
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707242147v6a04f50dkf35a11b67f7ac63f@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<FF2E18AA-8137-4509-9780-B4CCA77CAE82@python.org>
	<e03b90ae0707242147v6a04f50dkf35a11b67f7ac63f@mail.gmail.com>
Message-ID: <Pine.BSF.4.62.0707250720050.43115@vette.gigo.com>

> Guarantee is a pretty strong word. A malicious person could post two
> messages with the same message-id, same date, but different bodies.

This is my concern too.  Especially since this is known information; it is 
trivial to be malicious.  Whatever was done, I think would *have* to deal 
with 'dupes', in some form or another.

From gustav at gcis.gov.za  Wed Jul 25 11:30:03 2007
From: gustav at gcis.gov.za (Gustav H Meyer)
Date: Wed, 25 Jul 2007 11:30:03 +0200
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <46A62C31.5010501@Newfield.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>	<849198AE-DEC3-44C8-A090-470720624185@python.org>	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<46A62C31.5010501@Newfield.org>
Message-ID: <46A7181B.60406@gcis.gov.za>

Hi,

I think this is the first time that I'm posting here but hopefully
not the last. Thanks to everyone involved for an incredible project.
I'm not much of a developer but I like practical solutions and will
do everything possible to help improve in this area even if it's
just to give some feedback.

I'm very excited about this project and can't wait for the next
version to come out with full integration between web forum and
mailing list. I like this idea very much and it seems that we're
going to see it real soon. :)

On 24/07/2007 18:43, Dale Newfield wrote:
> Jeff Breidenbach wrote:
>> In addition, Barry was talking about concocting a unique
>> identifier from the Date field and Message-ID. I'm not a big fan of
>> this idea, because the date field comes from the mail user agent
>> and is often wildly corrupt; e;g; coming from 100 years in the future.
> 
> Oh--I was assuming the Date to which he was referring was the current 
> timestamp at which mailman was processing the message.  I was going to 
> say that this guarantees uniqueness, but I guess there are parallel 
> mailman implementations where more than one machine/processor are all 
> serving the same list, and then two different machines/processors might 
> wind up with identical timestamps while processing two different messages.

I also like the idea of seeing the date somewhere in the URL but
IMHO we also need to see a unique sequential number. How about the
following idea:

http://my.list.server/archivebase/mylist/200707240001/msg00001/
http://my.list.server/archivebase/mylist/200707250001/msg00002/
http://my.list.server/archivebase/mylist/200707250002/msg00003/

and at the same time allow the following:
http://my.list.server/archivebase/mylist/msg00001/
http://my.list.server/archivebase/mylist/msg00002/
http://my.list.server/archivebase/mylist/msg00003/

This way you can see exactly how many messages were sent on a day
and how many messages have been sent since the start.

BTW the sequential number does in my view not have to be a decimal
value. Anything short and sweet will do as long as you can work it
out and at the same time allow for almost unlimited growth.

Just an idea.

Regards,
Gustav H Meyer

From stephen at xemacs.org  Wed Jul 25 18:40:08 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 26 Jul 2007 01:40:08 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <128BEC04-DFA5-4D9E-B813-8091FE3DEE94@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com>
	<e03b90ae0707241103qaba04b5hc72d2ee9d2d66f1@mail.gmail.com>
	<128BEC04-DFA5-4D9E-B813-8091FE3DEE94@python.org>
Message-ID: <874pjsl9nb.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > I agree, I just don't think message-ids are user friendly enough to  
 > be this canonical url.  Especially in this context, which is exactly  
 > where urls are thrown in users faces.  An archiving service is  
 > exactly the right place for redirecting human readable urls to the  
 > archiver's canonical url (by, I agree, 302).

I'm confused (to be precise, you're confusing me).  If human readable
URLs are exactly right for redirection to the canonical URL, why does
the canonical URL need to be user friendly?

A quick remark: the git SCM uses BASE16 SHA1s for object names, but
allows you to abbreviate them to the unique prefix.  A friendly
archive could do the same for your BASE32 ids.

Without going much into implementation, here's how I would write the
conformance section for our RFC.  The point is that I don't see any
need to discuss user-friendliness or the implementation of UUIDs for
the RFC!  This means that getting those right from the start is
not that important.

    0. Conformance

    0.1 List managers

    A conforming list manager MUST provide the List-Archive header
    field if the post is being archived.

    A conforming list manager MAY provide the List-Archive-UUID header
    field.  If so, the value MUST be guaranteed unique, and it MUST be
    present in the post as provided to the archiver.  The contents of
    this header need not be distinct from the contents of the
    Message-ID header, as long as the uniqueness guarantee is
    maintained.

    0.2 Archives

    A conforming archive MUST reserve the namespaces "message-id/" and
    "list-post-id/" relative to its base URL for the uses described
    below.

    A conforming archive MUST support retrieval by Message-ID, using
    the namespace "message-id/$(MESSAGE-ID)" relative to its base URL.
    The archive specified in the List-Archive header field MUST
    support access using the value of that field as its base URL.

    A conforming archive SHOULD support retrieval by UUID, using the
    namespace "list-post-id/$(LIST-ARCHIVE-UUID)" relative to its base
    URL.  If the scheme is "http" or "https", a conforming archive
    that does not support retrieval by UUID SHOULD return status 501
    NOT IMPLEMENTED with an entity explaining that retrieval by UUID
    is not implemented.

    A conforming archive MAY support "friendlyurls" for use where
    space is constrained (eg, in a post's footer).  A conforming
    archive may support any other URIs it wants to, too.<wink>  A
    third party SHOULD be able to regenerate a friendlyurl from the
    original message contents.

    0.3 Software

    Conforming archive software SHOULD provide interfaces for
    generating UUIDs and friendlyurls, if retrieval is supported.
    Conforming list managers SHOULD use these interfaces.

Some comments:

The interfaces for generated URLs should be provided as command line
utilities as well as callable functions.

Although the conformance level for friendlyurl support is "may", I
expect that essentially all archives will support friendlyurls.

The namespace for UUIDs and friendlyurls should probably be more
restricted than "any valid URI".

"List manager" denotes any source of archival content (eg, you could
imagine a user storing their outbox in a archive, so that the "list
manager" would actually be the user's MUA).  The namespaces suggested
above are good enough, I think, but there may be better ones.

Instead of 501 NOT IMPLEMENTED, I considered 410 GONE, but that
implies a request to delete the reference.  Since this is implemented
as a header in the post, the archive could be augmented to support it
later.

In the phrase "guaranteed unique", "guaranteed" means "to the level
provided by uuidgen or standard Message-ID generators".

Generation of friendlyurls or unique ids based on message body content
is probably a bad idea.


From stephen at xemacs.org  Wed Jul 25 18:56:45 2007
From: stephen at xemacs.org (Stephen J. Turnbull)
Date: Thu, 26 Jul 2007 01:56:45 +0900
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <F0935B34-0264-430E-B59D-4B8ADEA31DB9@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com>
	<F0935B34-0264-430E-B59D-4B8ADEA31DB9@python.org>
Message-ID: <873azcl8vm.fsf@uwakimon.sk.tsukuba.ac.jp>

Barry Warsaw writes:

 > Yes, definitely.  What do you think of the base32 examples I have on  
 > the wiki page?

They're somewhat better than Message-IDs for readability, but they're
not user-friendly.

 > On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:
 > 
 > > It seems silly to generate nice short links but then use message-id.

The use case for the message-id is not people.  It's software, which
doesn't much care about "nice short".  But the developers debugging
and maintaining the software will thank us for the ease of verifying
that the URL goes to the right place.


From jeff at jab.org  Thu Jul 26 08:23:55 2007
From: jeff at jab.org (Jeff Breidenbach)
Date: Wed, 25 Jul 2007 23:23:55 -0700
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <2009EA3C-9E11-4B2A-BF57-A62C0EB11870@python.org>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<877ioqys3k.fsf@athene.jamux.com>
	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>
	<FF2E18AA-8137-4509-9780-B4CCA77CAE82@python.org>
	<e03b90ae0707242147v6a04f50dkf35a11b67f7ac63f@mail.gmail.com>
	<2009EA3C-9E11-4B2A-BF57-A62C0EB11870@python.org>
Message-ID: <e03b90ae0707252323r25ea9099oe3f93cb90b6fc1e3@mail.gmail.com>

> If you improve the script or find numbers that lead to different
> conclusions, now's the time to know!

Live and learn!

So I just looked at 2 million raw messages from 2007, spread over
a few thousand mailing lists (all data is from mail-archive.com). My
first question was - when comparing only with messages from the
same list - how many times do I see a repeated message-id? The
answer was ... drumroll please ... 260 thousand. What the hell?

Time for a closer look. In some cases, the archiver was getting two
copies of every message. For example, the MLM (mailman) was
sending out a message to subscriber A and subscriber B, and both
paths eventually lead to the archiver.

In another case, the MLM (YahooGroups) spammed 20 copies of the
same message to every subscriber, and modified the body of each one.
YahooGroups tends create HTML mail and sticks ads, possibly spyware,
and who knows what other crap in message footers.

There's probably other categories I haven't noticed yet, 260k messages
is a lot of checking. So you'd think the archives would be a complete
mess. But they aren't and I had no idea anything was remotely amiss
under the hood. That's because mhonarc only archives one message
per message-id. So those 19 repeats from YahooGroups get thown away.
This is actually a pretty robust strategy when you think about it; it keeps
lots of annoyances out of archives and everyone who gets smited
deserves it; accidental duplicates, malicious duplicates, broken mail
transfer agents. Reasonable people can disagree, but I like it.

So I'm amending my request. If mailman and pipermail++ want to
keep a verbatim record of everything passing through the MLM, fine.
But please make it also possible to interoperate with archivers that
use the looser mhonarc strategy, e.g. allow the interoperability URL
to collide when message-ids collide. Currently Stephen's proposal
allows this, Barry's does not.

Just to make things really concrete, here's an example from that
YahooGroups collision I was describing. The 20 messages spammed to
subscribers would all have a interoperability URL something like this
(but perhaps not quite so enormously long) embedded in the
message, in both headers and possibly a footer.

http://www.mail-archive.com/search?l=estika%40yahoogroups.com&q=3578.125.161.129.196.1175036508.CBNWebMail%40webmail1.cbn.net.id

Clicking on it, the user goes to the archive server. For this particular
archiver, an HTTP 302 redirect takes the user to another URL which
happens to be more human friendly. But the details of what alternate
URLs are available - if any - is really up to the archive server.

http://www.mail-archive.com/estika at yahoogroups.com/msg01341.html

I think that's about it. I do kind of like Stephen's suggestion of
allowing the archiver to supply a formuia for interoperability URL;
if that's the case I'd say the RFC2369 headers could be fair game
for use in the calculation. That allows cross posted messages to
easily link to their correct archive - note how I used the contents of
List-Post when creating the interoperability URL above.

Jeff

From Dale at Newfield.org  Thu Jul 26 09:37:37 2007
From: Dale at Newfield.org (Dale Newfield)
Date: Thu, 26 Jul 2007 03:37:37 -0400
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <e03b90ae0707252323r25ea9099oe3f93cb90b6fc1e3@mail.gmail.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>	<877ioqys3k.fsf@athene.jamux.com>	<e03b90ae0707240931n33e3d595raf953e3ddb724980@mail.gmail.com>	<FF2E18AA-8137-4509-9780-B4CCA77CAE82@python.org>	<e03b90ae0707242147v6a04f50dkf35a11b67f7ac63f@mail.gmail.com>	<2009EA3C-9E11-4B2A-BF57-A62C0EB11870@python.org>
	<e03b90ae0707252323r25ea9099oe3f93cb90b6fc1e3@mail.gmail.com>
Message-ID: <46A84F41.4020003@Newfield.org>

Jeff Breidenbach wrote:
> So I just looked at 2 million raw messages from 2007, spread over
> a few thousand mailing lists (all data is from mail-archive.com). My
> first question was - when comparing only with messages from the
> same list - how many times do I see a repeated message-id? The
> answer was ... drumroll please ... 260 thousand. What the hell?

I think the question you were originally going to ask got sidetracked. 
If we assume that all these "multiple paths from list to archive" 
duplicates not only share a Message-ID but also a Date (they were the 
same message originally, so they should!), then both schemes (messageid, 
and messageid+date) would decide that all (but one of) these messages 
are redundant.

What we really want to know is how many (non-empty) Message-ID 
collisions are there that *don't* share a Date?  This is the number of 
messages that only-messageid loses, and that the composite identifier 
method would not lose.

-Dale

From jeff at jab.org  Fri Jul 27 06:56:44 2007
From: jeff at jab.org (Jeff Breidenbach)
Date: Thu, 26 Jul 2007 21:56:44 -0700
Subject: [Mailman-Developers] Improving the archives
In-Reply-To: <200707270320.l6R3KXCJ028654@gator.earlhood.com>
References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com>
	<849198AE-DEC3-44C8-A090-470720624185@python.org>
	<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>
	<7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org>
	<e03b90ae0707232302q45d54768l97bc98dcbeb76c7e@mail.gmail.com>
	<874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp>
	<419AFBBF-82FF-4939-B85B-85055A1B8482@python.org>
	<200707270320.l6R3KXCJ028654@gator.earlhood.com>
Message-ID: <e03b90ae0707262156i5c16d629g652bfb6624d3ee39@mail.gmail.com>

> If you are relying on the sender to do the right thing, then
> why not force them to create proper message-ids?

I think Barry's proposal is essentially a numbers game - e.g.
he's hoping for significantly better results using "Date" in
the calculation than not using it.

http://wiki.list.org/display/DEV/Stable+URLs

I'll try to tease out some more useful stats from some large
datasets this weekend. (I can't just run the python scripts as is
because I don't have python 2.5 in the same place as the data,
I don't keep raw message in mbox format, blah blah blah, but
we'll figure it out).

My hypothesis is "Date" doesn't really buy much, but that's
in part because I have a vested interest in that outcome.
We'll see how the data plays out. And I still think RFC2369
headers are needed in the calculation if cross posted
messages are to be handled correctly.

Jeff