From terri at zone12.com Tue Jul 3 05:06:31 2007 From: terri at zone12.com (Terri Oda) Date: Mon, 2 Jul 2007 23:06:31 -0400 Subject: [Mailman-Developers] Improving the archives Message-ID: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> Since I've largely finished up the coding contract that was eating up a lot of my time, I'm thinking that I'd like to do some coding for fun. And nothing says fun like trying to fix the Mailman archives! ;) I'm trying to remember all the things people have suggested for the archives in the past so I can figure out what needs to be done and what might be nice to have, and see if this is doable in the time I have in the foreseeable future. The big things people wanted most, if I recall correctly, included: - modernized HTML/CSS/Themes (preferably to match a modernized web interface... is that all set up now?) - archive links that won't break if the archive is rebuilt - better address obfuscation (maybe by generating pages through cgi) - search - not adding a billion dependencies to Mailman Here's the list from the wiki's Mailman 2.2 page: http:// wiki.list.org/display/DEV/Mailman+2.2 * Reconsider using a 3rd-party archiver * Perhaps URLs to messages should be based on message-ids instead of message numbers so that regenerating archives can't break links. This must include backward compatible links * Ditch direct access and vend all archive messages through CGI so that we can do address obfuscation, and message deletion, etc. on the fly (with caching of course, but have to worry about web crawlers). * Add RSS feed * Allow for admins to remove or edit messages through the web. * Move archive threads into another list? * Put archives in the list/mylist directory. * Add a search option * Make archives default template look and feel similar to Web UI (whatever it looks like after the Summer of Code project is done) * Make archive templatable (at least by changing CSS) so they can match people's existing site look-and-feel * MUAs usually make URLs clickable. An new Archive could be used when posts are distributed, in the footer, so that each message has a link to the whole thread in the Archive. * Present all messages in a thread at once, and offer plaintext download of the whole thread * Put messages into a database and/or move away from mbox as the canonical storage format. So the questions are: (1) Is anyone working on this already? (2) What else is on people's wish lists for a pipermail replacement? Terri From huston at astro.princeton.edu Tue Jul 3 13:36:23 2007 From: huston at astro.princeton.edu (Steve Huston) Date: Tue, 03 Jul 2007 07:36:23 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> Message-ID: <468A34B7.3080201@astro.princeton.edu> I'll admit to not having read previous discussions on this topic, but I'll also add my 2 here: On 7/2/07 11:06 PM, Terri Oda wrote: > - better address obfuscation (maybe by generating pages through cgi) I run a few Wordpress sites, and there's a plugin I use called PHPEnkoder which does a good job of this. It basically wraps the address around a little bit of Javascript; if you have Javascript turned on in the browser, it's seamless, and if not you see "Javascript required to view address" or something like that. The theory is that bots and such don't run JS, so it's "safe" from harvesting. I'll leave it to the list as to how true an assessment this is, but it Works For Me :> > * Add a search option I know there's been patches around forever that integrate ht://Dig with Pipermail; maybe some way to do this, while still making it an option that can be tuned? If ht://Dig is there and you turn on the option, it works, but if it's not then it's not required? This would satisfy the "not adding a billion dependencies", but may be overkill as well. I'll also happily admit to not knowing much about the cost of search engines to a system. > * MUAs usually make URLs clickable. An new Archive could be used > when posts are distributed, in the footer, so that each message has a > link to the whole thread in the Archive. This would be a Godsend. A group at work here runs an old homebrewed exploder, and a few years ago I tried to convert them to Mailman. They liked everything they saw, up until the point where they couldn't refer to some kind of short and simple message number, and get right to that message in the archive. The current system generates a number based on a simple incrementing index of the list, and many months after a mailing people will refer to "message #483", and know they can view it at http://hostname/foo/listname/483.html - which is also posted in the footer of the message sent out. Of course, if the archives were based on Message-ID headers, this may make such a number a bit unwieldly, but if it were some kind of simple-ish system I might finally get rid of those old lists :> -- Steve Huston - W2SRH - Unix Sysadmin, Dept. of Astrophysical Sciences Princeton University | ICBM Address: 40.346525 -74.651285 126 Peyton Hall |"On my ship, the Rocinante, wheeling through Princeton, NJ 08544 | the galaxies; headed for the heart of Cygnus, (609) 258-7375 | headlong into mystery." -Rush, 'Cygnus X-1' From barry at python.org Wed Jul 4 02:05:12 2007 From: barry at python.org (Barry Warsaw) Date: Tue, 3 Jul 2007 20:05:12 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> Message-ID: <849198AE-DEC3-44C8-A090-470720624185@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 2, 2007, at 11:06 PM, Terri Oda wrote: > Since I've largely finished up the coding contract that was eating up > a lot of my time, I'm thinking that I'd like to do some coding for > fun. And nothing says fun like trying to fix the Mailman archives! ;) That would be awesome Terri! It's an aspect of Mailman that sorely needs attention, and you will gain (even more) fame and fortune by working on it. :) I totally support this effort. > I'm trying to remember all the things people have suggested for the > archives in the past so I can figure out what needs to be done and > what might be nice to have, and see if this is doable in the time I > have in the foreseeable future. > > The big things people wanted most, if I recall correctly, included: > > - modernized HTML/CSS/Themes (preferably to match a modernized web > interface... is that all set up now?) It's not, but Andrew Kuchling will be working on this. I haven't yet revealed detailed plans, though I'm working on an email about this over the U.S. July 4th holiday. But I suppose it's time for a quick summary: I'd like to get a Mailman 2.2 out with an updated u/i sooner rather than later, and if possible an updated archiver would be one of those few other new features that I think could go into a 2.2. OTOH, it would be fine if we pushed that off to Mailman 3 too, but it leveraged all the u/i work to be done in 2.2. > - archive links that won't break if the archive is rebuilt Yes, this is absolutely critical, in fact, I'd put it right at the top of the list, even more so than a u/i overhaul. Stable urls, with backward compatible redirecting links if at all possible, would be fantastic. Along with that, I would really like to come up with an algorithm for calculating those urls without talking to the archiver. This would allow the list delivery queue to calculate the List-Archive: header value and any message header/footer substitutions before the message hits the archiver. > - better address obfuscation (maybe by generating pages through cgi) I'd still love to do this, and I think were it not for crawlers, we could get a lot of mileage out of creation on demand and caching. But how do you handle Google crawling your archive? > - search Another huge huge feature. > - not adding a billion dependencies to Mailman Definitely. I'm also not opposed to changing the interface between Mailman and the archivers if necessary. > Here's the list from the wiki's Mailman 2.2 page: http:// > wiki.list.org/display/DEV/Mailman+2.2 We should probably start a separate archiver wiki page. I plan on re- organizing the 2.2 page anyway, so I'll probably end up doing that if you don't get around to it before me . > (1) Is anyone working on this already? Not that I know of. > (2) What else is on people's wish lists for a pipermail replacement? Other things high on my list are ditching the crufty storage currently being used (pickles begone!), an RSS feed, and a 'message storage' which could be used to vend archived messages through other delivery transports, such as imap or nntp. But I'd be willing to put all that off for stable urls, an updated u/i, and searching. Anything I can do to help, please let me know. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRorkOHEjvBPtnXfVAQLw0wP/TFgXxFAcK+3QiDG4jkyPCVVpP0EqATwB nYfUDrf0ytuTphFMM4gJmWbZdtR1HJ2xqNOit18QTsM/pjTiIDB++nH0IoRkRwy3 qs4JdBb+m3Amuxaaa4dQp+nWQt2yUMsF/HWp3BS/vx8oCfkjMhOKDI29/UG9jU+L L64QzWeywGw= =ewlo -----END PGP SIGNATURE----- From barry at python.org Wed Jul 4 02:13:56 2007 From: barry at python.org (Barry Warsaw) Date: Tue, 3 Jul 2007 20:13:56 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <468A34B7.3080201@astro.princeton.edu> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <468A34B7.3080201@astro.princeton.edu> Message-ID: <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Steve makes me think of a couple of other wish list items. On Jul 3, 2007, at 7:36 AM, Steve Huston wrote: > On 7/2/07 11:06 PM, Terri Oda wrote: >> - better address obfuscation (maybe by generating pages through cgi) > > I run a few Wordpress sites, and there's a plugin I use called > PHPEnkoder which does a good job of this. I have this idea that you could gateway messages from an archive or mailing list to and from a bulletin board forum. Maybe this doesn't fall within the scope of the archiver because I could see a 'forum queue' like we have an nntp queue, but in that case, being able to calculate an archive url without talking to the archiver becomes important again. It would be nice in that case to put a link to the archive message in the forum post. >> * MUAs usually make URLs clickable. An new Archive could be used >> when posts are distributed, in the footer, so that each message has a >> link to the whole thread in the Archive. > > This would be a Godsend. A group at work here runs an old homebrewed > exploder, and a few years ago I tried to convert them to Mailman. > They > liked everything they saw, up until the point where they couldn't > refer > to some kind of short and simple message number, and get right to that > message in the archive. This reminds me, I would love to have a link in an archive message that I could click to get the message sent to me, as it originally appeared on the mailing list. If I had that, I'd never need to locally save another mailing list post. I'd just search for the one I wanted, go to the archive, click on the "send it to me" link, then do a normal reply in my mail reader. > The current system generates a number based on > a simple incrementing index of the list, and many months after a > mailing > people will refer to "message #483", and know they can view it at > http://hostname/foo/listname/483.html - which is also posted in the > footer of the message sent out. Of course, if the archives were based > on Message-ID headers, this may make such a number a bit unwieldly, > but > if it were some kind of simple-ish system I might finally get rid of > those old lists :> This would be possible with today's system, but it leads to unstable urls, especially when you consider archive scrubbing (which, come to think of it, is another wish list item ;). We'd like for an admin to be able to easily pull an archive message, but it's even worse than that. Sometimes an admin has to scrub the actual backing message store (e.g. today's mbox file). This will change the message counts and thus the incremental indexes. Maybe a way to think about this is that the canonical url is based on the message-id, but then there's some way to distill even this down to a tinyurl or simple integer that would be stable in the face of full archive regenerations. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRormRHEjvBPtnXfVAQIHYwP/fLnY/pebRlhrFeUpPJu5VfZNyR24oLId qjZ4F2MHW25LcemvGzpeUSgXRQJk2LQIQKSlYYtTM+8xcStey4IvDnPLmzX5MQOC xiI9PznZHdLmbF9SaUDZQZBRKZhqCNeslZ5zpnN35KStL3NlTc6PkBylzIC7Y47F a3RxMEOgMaA= =HM9I -----END PGP SIGNATURE----- From stephen at xemacs.org Wed Jul 4 09:49:58 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Wed, 04 Jul 2007 16:49:58 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <849198AE-DEC3-44C8-A090-470720624185@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> Message-ID: <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > > - archive links that won't break if the archive is rebuilt > > Yes, this is absolutely critical, in fact, I'd put it right at the > top of the list, even more so than a u/i overhaul. Stable urls, with > backward compatible redirecting links if at all possible, would be > fantastic. +1. I've been wanting to do something about this, and have made proposals (not back with code, mea maxima culpa) for design. I would definitely be happy to help with this, but given time constraints, it would be nice if somebody else could take the lead. > Along with that, I would really like to come up with an algorithm for > calculating those urls without talking to the archiver. Brad didn't like this when I suggested it before, but I didn't really understand why not. Anyway, FWIW: I suggest adding an X-List-Received-ID header to all messages. I haven't really thought through whether the UUID in that field should be at least partly human-readable or not, but that doesn't matter for the basic idea.[1] The on-disk directory format would be /path-to-archive/private/my-list/Message-ID for singletons (Message-ID is the author-supplied ID) and /path-to-archive/private/my-list/Message-ID/List-Received-ID for multiples. These would be created on-the-fly when they occur. They can be served as static pages. For almost all messages, the bare URL http://archives.example.com/my-list/Message-ID should Just Work (ie, return a no-such-object result or a single message). Where it does not, you get an index of all pages with that message ID. The main drawback to using Message IDs that I can see is that broken MUAs may supply no Message-ID, or the same one repeatedly. In the former case, as a last resort Mailman can supply one, but that won't help people who get a personal copy and want to find the thread. However, I see no way to help them, anyway, beyond a generic archive search engine. In the latter, you get lots of messages matching the Message-ID, and while most lists should have *zero* problems, a list that has any instances of this problem would have many. Again I can't see a good way to deal with this other than a general search facility, as computing a digest of headers or content is hard to do reliably. Providing an index of matching posts seems like a reasonable approach, which can be efficiently implemented (eg, as static pages). Furthermore, the examples I've seen of both in the last few years have all been either spam or (in the case of duplicate Message-IDs) actual duplicates due to some mail system problem or itchy user fingers. A minor drawback to my proposal is that if a message gets archived as a singleton for that Message-ID, then a duplicate arrives, previously created references in the archive will of course now return an index rather than the desired message. Ie, there is data corruption. This can be dealt with in several ways; the easiest would be to provide a "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking-for-me" link when creating the directory for multiple instances. There's also a *very* minor benefit: repeat sends will be immediately recognizable without checking Message-ID. Footnotes: [1] By partly human-readable I mean containing list-id and date information. The idea would be to have the date come first, so that users would have a shot at identifying which of several messages is most likely, and this would be searchable by eye with simply an ordinary sorted index. From jam at jamux.com Wed Jul 4 18:58:20 2007 From: jam at jamux.com (John A. Martin) Date: Wed, 04 Jul 2007 12:58:20 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> (Stephen J. Turnbull's message of "Wed, 04 Jul 2007 16:49:58 +0900") References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <87k5tgxg0j.fsf@athene.jamux.com> A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 154 bytes Desc: not available Url : http://mail.python.org/pipermail/mailman-developers/attachments/20070704/b9919498/attachment.pgp From Dale at Newfield.org Wed Jul 4 19:16:58 2007 From: Dale at Newfield.org (Dale Newfield) Date: Wed, 04 Jul 2007 13:16:58 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <468A34B7.3080201@astro.princeton.edu> <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> Message-ID: <468BD60A.2020709@Newfield.org> I'm all for someone taking ownership of this long-neglected component -- thank you for doing so! Barry Warsaw wrote: > Maybe a way to think about this is that the canonical url is based on > the message-id, but then there's some way to distill even this down > to a tinyurl or simple integer that would be stable in the face of > full archive regenerations. The resistance to basing this on message-id has always been that there's no guarantee of uniqueness... ...but I believe each list has some sort of counter for how many messages it's seen, so we could add another header with that number, and use as a unique id the two concatenated together... (That way the archiver can know from the content of the header exactly how to generate the same unique id as mailman, which would allow for the url-in-the-footer to happen w/o first hitting the archiver.) Just throwing out ideas, -Dale From jeff at jab.org Wed Jul 4 21:30:04 2007 From: jeff at jab.org (Jeff Breidenbach) Date: Wed, 4 Jul 2007 12:30:04 -0700 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <468A34B7.3080201@astro.princeton.edu> <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> Message-ID: >Maybe a way to think about this is that the canonical url is based on >the message-id, but then there's some way to distill even this down >to a tinyurl or simple integer that would be stable in the face of >full archive regenerations. I'd suggest the reverse. Keep the canoncical archive URL short and sweet, and then use a URL redirection service to map message-id's to those URLs. It is the archiver's job to make it all work. For example, the canonical archive URL might stay exactly the way it is in pipermail. But the archival link embedded in the message would instead go to a redirection service. http://mail.codeit.com/pipermail/zcommerce/2002-February/000523.html http://mail.codeit.com/msgid?002701c4eb3d$07170ca0$3142003e at ADSL The one other thing I'd ike to revisit is integration with third party archival services. There are two obvious integration points; one is a button in the Mailman list admin user interface that says "archive with service X" not unlike the setting in Firefox that basically says "search with service X". The other integration point is the archival link discussed above. In which case it would be set to something like. http://third-party-service/msgid?002701c4eb3d$07170ca0$3142003e at ADSL Disclosure: I help run a third party archiving service, and this topic was discussed quite a bit previously. [1] Nonetheless it seems like a good time revisit given the current discussion about archive wishlists. [1] http://www.mail-archive.com/mailman-developers at python.org/msg08772.html From jeff at jab.org Thu Jul 5 06:48:30 2007 From: jeff at jab.org (Jeff Breidenbach) Date: Wed, 4 Jul 2007 21:48:30 -0700 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <468A34B7.3080201@astro.princeton.edu> <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> Message-ID: >In which case [the message body link] would be set to something like. > >http://third-party-service/msgid?002701c4eb3d$07170ca0$3142003e at ADSL Just for fun, I did a trial implementation. It works, but the URLs are too long. For example, the URL below spends 59 characters on the messag-id, and 27 characters on the listname. We're already over my comfort level (of about 72 characters) and haven't even started to count the hostname, and other URL-lengthening overhead. Maybe this was a bad idea after all. http://www.mail-archive.com/search?l=mailman-developers%40python.org&q=e03b90ae0707041230m47110705t89cdbe3d2e4802cd at mail.gmail.com Jeff From jdennis at redhat.com Thu Jul 5 18:09:29 2007 From: jdennis at redhat.com (John Dennis) Date: Thu, 05 Jul 2007 12:09:29 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <849198AE-DEC3-44C8-A090-470720624185@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> Message-ID: <1183651769.10813.6.camel@finch.boston.redhat.com> On Tue, 2007-07-03 at 20:05 -0400, Barry Warsaw wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Jul 2, 2007, at 11:06 PM, Terri Oda wrote: > > > Since I've largely finished up the coding contract that was eating up > > a lot of my time, I'm thinking that I'd like to do some coding for > > fun. And nothing says fun like trying to fix the Mailman archives! ;) > > That would be awesome Terri! It's an aspect of Mailman that sorely > needs attention, and you will gain (even more) fame and fortune by > working on it. :) I totally support this effort. A little over a year ago I went on a search to find the best open source archiver and at that time I came up with Lurker (http://lurker.sourceforge.net) Since then I believe Lurker has seen a major new revision. I also believe Lurker is the archiver used by Debian. So if you want to leverage existing open source archiving or at least look at an example of what would be necessary to allow easy easy external archiving integration with Mailman you might want to look at Lurker. -- John Dennis From terri at zone12.com Thu Jul 5 19:02:37 2007 From: terri at zone12.com (Terri Oda) Date: Thu, 5 Jul 2007 13:02:37 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <1183651769.10813.6.camel@finch.boston.redhat.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> Message-ID: On 5-Jul-07, at 12:09 PM, John Dennis wrote: > A little over a year ago I went on a search to find the best open > source > archiver and at that time I came up with Lurker > (http://lurker.sourceforge.net) Since then I believe Lurker has seen a > major new revision. I also believe Lurker is the archiver used by > Debian. I was hoping someone would post that link! Lurker was best of breed last time I was looking, and I'd definitely like to see what we can leverage there. Terri From barry at python.org Sat Jul 7 18:35:30 2007 From: barry at python.org (Barry Warsaw) Date: Sat, 7 Jul 2007 12:35:30 -0400 Subject: [Mailman-Developers] Mailman roadmap Message-ID: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Now that we've successfully navigated the switch to Bazaar, it's time to lay out plans for future Mailman releases. I've talked to several people about what to do about Mailman's future and I'd like to take this opportunity to describe my thoughts and get your feedback. First some background. Mailman 2.1 is (shockingly) four and a half years old, having been initially released on 30-Dec-2002. The last release in the series, 2.1.9 was made almost a year ago. In the meantime, Mark and Tokio have been doing a great job maintaining the 2.1 branch, with several important patches in the tree now that will eventually become 2.1.10. The problem of course is that we can't add any new features to the 2.1 family , so we should be thinking about a new major release. I've been making good progress on the SQAlchemy/Elixir version, which will finally get rid of pickles and put Mailman on a Real Database (tm). It's been clear to me for a while that this branch will have a unified user database. It simply makes no sense to build the database back-end without once and for all fixing this design constraint. I've always said that the unified user database will be in Mailman 3, and thus this branch is indeed called "Mailman 3.0". I've been slowly building things back up from the ground floor. The basic data model is in pretty good shape and I'm taking a religious test-driven approach to making things work again. But the branch still needs a lot of work, and I have no ETA for Mailman 3.0. In the meantime, Andrew Kuchling and others have volunteered to work on modernizing the Mailman web u/i, and Terri recently started a thread discussing updates to the archiver. I think it makes sense to bless these efforts, towards the goal of releasing them in Mailman 2.2. I intend to create an official Mailman 2.2 branch in bzr where these efforts can land as they mature. My hope of course is that we'll also be able to use much of this new code for Mailman 3. I'd like to keep the changes for 2.2 focused on the web u/i and archiver, with a small number of additional features to be determined. Mailman 2.2 should see no changes to the basic architecture or 'database'; we'll continue to use pickles by default for Mailman 2.2. While I won't rule out other new features, I want to be very picky about those that are accepted for 2.2, and would not feel bad at all if we rejected or deferred until 3.0 most of those proposed. Criteria for other 2.2 features must include minimal code impact with a high degree of reliability and stability. I plan on updating the wiki pages to reflect this thinking, but I would like to get feedback from y'all about the plan. It would be awesome if we could see a release of Mailman 2.2 some time in late 2007 or early 2008. Comments, question? - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRo/A0nEjvBPtnXfVAQJ+dwP7BXXLM749qO6CXWQKZw41pFN42jYfN6Kg LjNAQ9IejAT/TISGrSgk8UyZ9kP6ajnOFvKIfJNTFJdytJg8/lvDQSeW1N0u7sR+ Wp0N1e0qA4qfqLYsqRR9W1MQhecdBO/yEJo8KDsOQdGnpfINSKZ40FUvPEbC40U7 C/T83gS+Vxs= =JJZS -----END PGP SIGNATURE----- From barry at python.org Sat Jul 7 18:36:54 2007 From: barry at python.org (Barry Warsaw) Date: Sat, 7 Jul 2007 12:36:54 -0400 Subject: [Mailman-Developers] Mailman roadmap Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sorry, I forgot to cross-post this to mailman-users, so I'm reposting. - -Barry Now that we've successfully navigated the switch to Bazaar, it's time to lay out plans for future Mailman releases. I've talked to several people about what to do about Mailman's future and I'd like to take this opportunity to describe my thoughts and get your feedback. First some background. Mailman 2.1 is (shockingly) four and a half years old, having been initially released on 30-Dec-2002. The last release in the series, 2.1.9 was made almost a year ago. In the meantime, Mark and Tokio have been doing a great job maintaining the 2.1 branch, with several important patches in the tree now that will eventually become 2.1.10. The problem of course is that we can't add any new features to the 2.1 family , so we should be thinking about a new major release. I've been making good progress on the SQAlchemy/Elixir version, which will finally get rid of pickles and put Mailman on a Real Database (tm). It's been clear to me for a while that this branch will have a unified user database. It simply makes no sense to build the database back-end without once and for all fixing this design constraint. I've always said that the unified user database will be in Mailman 3, and thus this branch is indeed called "Mailman 3.0". I've been slowly building things back up from the ground floor. The basic data model is in pretty good shape and I'm taking a religious test-driven approach to making things work again. But the branch still needs a lot of work, and I have no ETA for Mailman 3.0. In the meantime, Andrew Kuchling and others have volunteered to work on modernizing the Mailman web u/i, and Terri recently started a thread discussing updates to the archiver. I think it makes sense to bless these efforts, towards the goal of releasing them in Mailman 2.2. I intend to create an official Mailman 2.2 branch in bzr where these efforts can land as they mature. My hope of course is that we'll also be able to use much of this new code for Mailman 3. I'd like to keep the changes for 2.2 focused on the web u/i and archiver, with a small number of additional features to be determined. Mailman 2.2 should see no changes to the basic architecture or 'database'; we'll continue to use pickles by default for Mailman 2.2. While I won't rule out other new features, I want to be very picky about those that are accepted for 2.2, and would not feel bad at all if we rejected or deferred until 3.0 most of those proposed. Criteria for other 2.2 features must include minimal code impact with a high degree of reliability and stability. I plan on updating the wiki pages to reflect this thinking, but I would like to get feedback from y'all about the plan. It would be awesome if we could see a release of Mailman 2.2 some time in late 2007 or early 2008. Comments, question? - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRo/BJnEjvBPtnXfVAQL7iwP/TfPPvMsTnrrSxQAlvPjQoR27ySqUYh+P yZCvGxxp9DgNoFQOWl0mo1QzZ9ozXtiFfIHx4CJLybOis+yuiq+BWtih2MJnGBf7 SzD8qsBOu6N4sE8sn4n0tdmXr1fnh4qnrgTobvBX+3toJtHNGQTEVEZCxiWb5fKq JsUKDVVvOhQ= =CVNK -----END PGP SIGNATURE----- From barry at python.org Sat Jul 7 22:19:50 2007 From: barry at python.org (Barry Warsaw) Date: Sat, 7 Jul 2007 16:19:50 -0400 Subject: [Mailman-Developers] Mailman roadmap In-Reply-To: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 7, 2007, at 12:35 PM, Barry Warsaw wrote: > I intend to create an official Mailman 2.2 branch in bzr where > these efforts can land as they mature. This branch is now live. http://wiki.list.org/display/DEV/MailmanBranches Cheers, - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRo/1Z3EjvBPtnXfVAQIwNAP+MgJj026MHEGwdXMoma9uNkTp8UzeLtCC mKh7OkcmZPMzKrNdlztQ5OmU1N1SWf9medErjM7QcKeIR0y+9aUjC65j8mamwWPa +XrbzdlWZoDxnO5qFh02rVNFATKRH00+ITiB6LvTEKJVxp9r+WL1sKq0FEElu9/W zkl80deXVvQ= =V7ce -----END PGP SIGNATURE----- From pabs at debian.org Sun Jul 8 07:06:02 2007 From: pabs at debian.org (Paul Wise) Date: Sun, 8 Jul 2007 15:06:02 +1000 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> Message-ID: On 7/3/07, Terri Oda wrote: > I'm trying to remember all the things people have suggested for the > archives in the past so I can figure out what needs to be done and > what might be nice to have, and see if this is doable in the time I > have in the foreseeable future. At lists.indymedia.org, we use a patch that provides these: * stable URLs based on a generated message id * URLs to the archived message in the message headers * message hiding http://lists.indymedia.org/patches/imc-10-mmid_hide_posts.patch It poses a bit of a migration issue since all the existing mboxes may or may not have the mmid header in them. We worked around that by having an special place for the old archives. We've been meaning to move to lurker for years, but haven't had the human resources and also there were some showstoppers: * public/private lists - lurker couldn't do that properly when we looked * lack of date-based index to the archives * general navigation issues; stuff like linking between current thread and nearby ones * mailto links (has now been fixed) * the migration nightmare My personal opinion is that pipermail should be removed and mailman should not contain a default archiver since there are plenty of good archivers already (lurker, mhonarc etc). Adding wrappers around them would be simpler than reimplementing them. -- bye, pabs https://docs.indymedia.org/view/Main/PaulWise From iane at sussex.ac.uk Mon Jul 9 15:35:33 2007 From: iane at sussex.ac.uk (Ian Eiloart) Date: Mon, 09 Jul 2007 14:35:33 +0100 Subject: [Mailman-Developers] Mailman roadmap In-Reply-To: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> Message-ID: --On 7 July 2007 12:35:30 -0400 Barry Warsaw wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Now that we've successfully navigated the switch to Bazaar, it's time > to lay out plans for future Mailman releases. .... > I plan on updating the wiki pages to reflect this thinking, but I > would like to get feedback from y'all about the plan. It would be > awesome if we could see a release of Mailman 2.2 some time in late > 2007 or early 2008. > > Comments, question? All sounds very good. My two main problems with driving Mailman uptake here on campus are both to do with usability. The first is the current web interface, which was great when it was developed, but expectations have moved on. The second is the lack of a unified user database. So, it's great to see these items listed as the mail focus for 2.2 and 3.0 respectively. WRT 2.2, I'd like to be able to offer something as simple to use as the list management features of Google Groups (which I use for some voluntary groups that I work with), but with the ability to expose additional functionality on request. WRT 3.0, for enterprise and education purposes, it's important to be able to hook into existing authentication and authorisation mechanisms. For us, that means LDAP - at least for authentication. On the other hand, we also have external people using our lists, so we need to be able to either put them into an SQL database which will work in conjunction with LDAP, or to add a separate LDAP tree for them, or something similar. Something that I've mentioned before, is the importance of preventing collateral spam. So, I'd like to be able to have my MTA ask Mailman whether a particular email address is permitted to post to a particular list, at SMTP time. I'm using Exim, which could call an external python script, but I'd rather be able to issue an SMTP callout to a running daemon, for efficiency. The callout would be executed after each "RCPT TO". -- Ian Eiloart IT Services, University of Sussex x3148 From thijs at debian.org Mon Jul 9 16:06:13 2007 From: thijs at debian.org (Thijs Kinkhorst) Date: Mon, 9 Jul 2007 16:06:13 +0200 Subject: [Mailman-Developers] Mailman roadmap In-Reply-To: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> Message-ID: <200707091606.15858.thijs@debian.org> On Saturday 7 July 2007 18:35, Barry Warsaw wrote: > Mailman 2.1 is (shockingly) four and a half years old, having been > initially released on 30-Dec-2002. The last release in the series, > 2.1.9 was made almost a year ago. In the meantime, Mark and Tokio > have been doing a great job maintaining the 2.1 branch, with several > important patches in the tree now that will eventually become > 2.1.10. The problem of course is that we can't add any new features > to the 2.1 family , so we should be thinking about a new major > release. These sound like sensible plans and I'm curious about what 2.2 and 3.0 will bring. However, my question is whether we can expect some 2.1.x releases in the short term (like 2.1.10 you mentioned). As you say it will take quite some while for 2.2 to be released, and we'd like to get the fixed bugs in the 2.1.x branch to our users in the meantime. Regular 2.1.x releases with assorted fixes would be welcome to not scare users away from Mailman while we're waiting for the "big" releases. thanks, Thijs -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 481 bytes Desc: not available Url : http://mail.python.org/pipermail/mailman-developers/attachments/20070709/70764086/attachment.pgp From stephen at xemacs.org Tue Jul 10 05:09:39 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 10 Jul 2007 12:09:39 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <87k5tgxg0j.fsf@athene.jamux.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5tgxg0j.fsf@athene.jamux.com> Message-ID: <87y7hpt0ng.fsf@uwakimon.sk.tsukuba.ac.jp> John A. Martin writes: > In the absence of a Message-ID > on an outgoing mail message many if not most MTAs will add one. Why > not let Mailman anticipate the need to add a Message-ID when archiving > the message rather than leaving it to the outgoing MTA? Quite. My reason for saying "last resort" is simply that this is not predictable to third parties. Eg, I send you (a non-subscriber) a message with CC and no Message-ID. You'd like to find the thread in the archives. You may as well just do a linear search on that month's threads. An URL based on an MD5 of the message body in theory would work, but in the presence of non-ASCII bodies, structured MIME, ML digests, and various MTA autoconversions, that seems fragile. From schwabe at upb.de Tue Jul 10 08:04:20 2007 From: schwabe at upb.de (Arne Schwabe) Date: Tue, 10 Jul 2007 08:04:20 +0200 Subject: [Mailman-Developers] Mailman roadmap In-Reply-To: References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> Message-ID: <46932164.3000007@upb.de> >> I plan on updating the wiki pages to reflect this thinking, but I >> would like to get feedback from y'all about the plan. It would be >> awesome if we could see a release of Mailman 2.2 some time in late >> 2007 or early 2008. >> >> Comments, question? >> > > All sounds very good. My two main problems with driving Mailman uptake here > on campus are both to do with usability. The first is the current web > interface, which was great when it was developed, but expectations have > moved on. The second is the lack of a unified user database. So, it's great > to see these items listed as the mail focus for 2.2 and 3.0 respectively. > > WRT 2.2, I'd like to be able to offer something as simple to use as the > list management features of Google Groups (which I use for some voluntary > groups that I work with), but with the ability to expose additional > functionality on request. > > At our University we developed a customized mini Interface called 'simple' Interface. The normal mailman Interface is still there, called 'expert admin'. A (non working) demo is here: https://lists.uni-paderborn.de/listadm/demo.html The code does not use the mailman template system nor does it have multi language abilities. It even includes code specific to our installation. (We have a membership class that maps users to user in ldap and can create dynamic list with users from ldap + static users) But maybe something like this should be included in future Mailman installation. Either a static simple interface or even a customizable simpe interface that is sufficent for 95% of the people (with well chosen defaults for your university/organisation) > WRT 3.0, for enterprise and education purposes, it's important to be able > to hook into existing authentication and authorisation mechanisms. For us, > that means LDAP - at least for authentication. On the other hand, we also > have external people using our lists, so we need to be able to either put > them into an SQL database which will work in conjunction with LDAP, or to > add a separate LDAP tree for them, or something similar. > This is possible with mailman 2.1 with a self written Mailman Membershipt class. At least for List Member. If someone really needs this I could look into polishing the code and making it public. > Something that I've mentioned before, is the importance of preventing > collateral spam. So, I'd like to be able to have my MTA ask Mailman whether > a particular email address is permitted to post to a particular list, at > SMTP time. I'm using Exim, which could call an external python script, but > I'd rather be able to issue an SMTP callout to a running daemon, for > efficiency. The callout would be executed after each "RCPT TO". > > > Same for Email that get rejected for spam reasons would be neat Arne From barry at python.org Fri Jul 20 14:02:34 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 08:02:34 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 4, 2007, at 3:49 AM, Stephen J. Turnbull wrote: >> Along with that, I would really like to come up with an algorithm for >> calculating those urls without talking to the archiver. > > Brad didn't like this when I suggested it before, but I didn't really > understand why not. Anyway, FWIW: > > I suggest adding an X-List-Received-ID header to all messages. I > haven't really thought through whether the UUID in that field should > be at least partly human-readable or not, but that doesn't matter for > the basic idea.[1] The on-disk directory format would be > > /path-to-archive/private/my-list/Message-ID > > for singletons (Message-ID is the author-supplied ID) and > > /path-to-archive/private/my-list/Message-ID/List-Received-ID > > for multiples. These would be created on-the-fly when they occur. > They can be served as static pages. For almost all messages, the bare > URL > > http://archives.example.com/my-list/Message-ID > > should Just Work (ie, return a no-such-object result or a single > message). Where it does not, you get an index of all pages with that > message ID. I think this suggestion has merit, but I'm going to riff on it a bit. First, I want to avoid talking about file system layout. To me, that's an implementation detail we needn't worry about right now. Maybe the files will live on disk, maybe they'll live in a database, maybe they'll live in an external system we don't control. I don't care. What I want is a uniform way to calculate an address for a message given nothing but its text and an interface for retrieving messages from a service given that address. I'm thinking about this in a RESTful way, and it's perfectly legitimate for that 'message address' to be relative to some archive or message store root. I've done some experiments. I took the top 5 mbox files on python.org and ran them through a script that looked for message-id collisions. Then I implemented 6 strategies for looking at whether the collisions were true collisions or duplicates. Duplicates are defined where every message in the same message-id bucket has the same match criteria, and collisions are where at least one message in the bucket is different. So for example, with strategy 2, if the message-id and date headers are the same for every message in the bucket, it's a dupe, otherwise it's a collision. While I ran the script over each mbox separately, I think it's more interesting to talk about them as a whole collection. I don't really know how representative this would be of the world at large, but it's interesting anyway. FTR, the lists were mailman-users, python-dev, python-help, python-list, and tutor. I think there would be little intentional cross-posting between these lists. Here are the numbers: total 325146, missing: 624 1. msg.as_string(), dup: 34 (0.0104568409268%), col: 914 (0.281104488445%) 2. message-id + date, dup: 875 (0.269109876794%), col: 73 (0.0224514525782%) 3. message-id + 1st received, dup: 270 (0.0830396191249%), col: 678 (0.208521710247%) 4. message-id + all received, dup: 270 (0.0830396191249%), col: 678 (0.208521710247%) 5. message-id + date + 1st received, dup: 268 (0.0824245108351%), col: 680 (0.209136818537%) 6. body_line_iterator(msg), dup: 659 (0.202678181494%), col: 289 (0.0888831478782%) Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%. While I'm almost tempted to ignore a hit rate that low, if you think of an archive holding 1B messages, you still get a lot of duplicates. OTOH, the rate goes down even lower if you consider the message-id and date headers. (Note, I did not consider messages missing a date header). How likely is it that two messages with the same message-id and date are /not/ duplicates? Heck, at that point, I'd feel justified in simply automatically rejecting the duplicate and chucking it from the archive. I spent a /little/ time looking at the physical messages that ended up as true collisions. Though by no means did I look at them all, they all looked related. For example, with strategy 2 some messages look like they'd been inadvertently sent before they were completed. I need to see if there's any similarities in MUA behind these, but again, I think we might be able to safely assume that collisions on message-id+date can be ignored. That leads me to the following proposal, which is just an elaboration on Stephen's. First, all messages live in the same namespace; they are not divided by target mailing list. Each message has two addresses, one is the Message-ID and one is the base32 of the sha1 hash of the Message-ID + Date. As Stephen proposes, Mailman would add these headers if an incoming message is missing them, and tough luck for the non-list copy. The nice thing is that RFC 2822 requires the Date header and states that Message-ID SHOULD be present. Why the second address? First, it provides as close to a guaranteed unique identifier as we can expect, and second because it produces a nearly human readable format. For example, Stephen's OP would have a second address of >>> mid '<87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp>' >>> date 'Wed, 04 Jul 2007 16:49:58 +0900' >>> # XXX perhaps strip off angle brackets >>> h = hashlib.sha1(mid) >>> h.update(date) >>> base64.b32encode(h.digest()) 'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI' I like base32 instead of base64 because the more limited alphabet should produce less ambiguous strings in certain fonts and I don't think the short b64 strings are short enough to justify the punctuation characters that would result. While RFC 3548 specifies the b32 alphabet as using uppercase characters, I think any service that accepts b32 ids should be case insensitive. A really Postel-y service could even accept '1' for 'I' and '0' for 'O' just to make it more resilient to human communication errors. I'd like to come up with a good name for this second address, which would suggest the name of the X- header we stash this value in. X- B32-Message-ID isn't very sexy. Maybe X-Message-Global-ID, since I think there's a reasonable argument to make that for well-behaved messages, that's exactly what this is. So now, think of the interface to a message store that supports this addressing scheme. Well it's something like: class MessageStore(Interface): def store_message(message): """Store the message. :raises ValueError: when the message is missing either the Message-ID header or a Date header. :raises DuplicateMessageError: when a message in the store already has a matching Message-ID and Date. An archive is free to raise this exception for duplicate Message-IDs alone. """ def get_message_by_global_id(key): """Locate and return the message from the store that matches `key`. :param key: The Global ID of the message to locate. This is the base32 encoded SHA1 hash of the message's Message-ID and Date headers. :returns: The message object matching the Global ID, or None if there is no such match. """ def get_messages_by_message_id(key): """Return the set of messages with a matching Message-ID `key`. :param key: The Message-ID of the messages to locate. :returns: The set of all messages in this store that have the given Message-ID. If none such matches are found, the empty set is returned. """ As far as generating pages based on the Message-ID or global id, I agree with Stephen's proposal. A page returned in response to a message-id request could return the message page or it could return an index of such messages. It would be up to the archive whether it would accept duplicate Message-IDs or not, but it would always be guaranteed that a page returned in response to a global id request would return one email message. Urls could be calculated by concatenating the List-Archive and X- Global-Message-ID headers, e.g. http://mail.python.org/pipermail/mailman-developers/ RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI would be the OP. This could point to the same resource as http://mail.python.org/pipermail/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI http://mail.python.org/pipermail/global/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI and /might/ point to the same resource as: http://mail.python.org/pipermail/mailman-developers/ 87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp http://mail.python.org/pipermail/mids/ 87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp > A minor drawback to my proposal is that if a message gets archived as > a singleton for that Message-ID, then a duplicate arrives, previously > created references in the archive will of course now return an index > rather than the desired message. Ie, there is data corruption. This > can be dealt with in several ways; the easiest would be to provide a > "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking- > for-me" > link when creating the directory for multiple instances. Or by using the global id, or by rejecting messages with duplicate message ids. > There's also a *very* minor benefit: repeat sends will be immediately > recognizable without checking Message-ID. > > Footnotes: > [1] By partly human-readable I mean containing list-id and date > information. The idea would be to have the date come first, so that > users would have a shot at identifying which of several messages is > most likely, and this would be searchable by eye with simply an > ordinary sorted index. I see searching, indexing, sorting, and providing other human readable urls into the message store as a function of the archive. Once you're looking at a link to the actual message, you're going to be looking at a url that contains the global id, regardless of the number of levels you have to go through or redirects involved. Apologies for letting this thread linger so long. I'm very interesting in hearing your thoughts and if there's general agreement, I'll write it up in the wiki. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCkWnEjvBPtnXfVAQIRhAP7BkuF5K0xOuie2GBqOOWDarksD5Oy49y9 /2WO+u4xH+BttIt3adHJS+K6ETYcK79c5Rf4uwZk40DqWKK7ay1zkxUn/LGXOJ0o CoWQG5ZyFUJUTkDXtxEWcZ8kkXaDTTSNz2eCtYgQAXw77A95E1SjV0YBs54bFK3A Bi9cjrKRDcM= =pyY6 -----END PGP SIGNATURE----- From barry at python.org Fri Jul 20 14:19:57 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 08:19:57 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <468BD60A.2020709@Newfield.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <468A34B7.3080201@astro.princeton.edu> <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> <468BD60A.2020709@Newfield.org> Message-ID: <0FE1D8A1-DFBE-41D3-AE5F-CF0FCF26FB61@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 4, 2007, at 1:16 PM, Dale Newfield wrote: > Barry Warsaw wrote: >> Maybe a way to think about this is that the canonical url is based on >> the message-id, but then there's some way to distill even this down >> to a tinyurl or simple integer that would be stable in the face of >> full archive regenerations. > > The resistance to basing this on message-id has always been that > there's > no guarantee of uniqueness... > ...but I believe each list has some sort of counter for how many > messages it's seen, so we could add another header with that > number, and > use as a unique id the two concatenated together... > (That way the archiver can know from the content of the header exactly > how to generate the same unique id as mailman, which would allow > for the > url-in-the-footer to happen w/o first hitting the archiver.) I'm not crazy about this idea for a couple of reasons. First, it means that someone who has a copy of the message that didn't come from the list (e.g. one of the two you will get of this message), cannot calculate this unique ID. Second, things can happen to a list that might cause this sequence number to get corrupted. Maybe a list will get deleted and then recreated. Maybe it will get moved and the sequence number will get reset in the move. Maybe the list will be upgraded to a new version of Mailman. I think we can do just as well by using Message-ID + Date and get very low collision rates. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCobXEjvBPtnXfVAQIHFQP/Sz6WVqyFmo0lraw0hyyP5x4AhgBPDQmA /rFfSBRGbdORLXA2Ss0YdhI5cy8n7LMSsLawgtSt+JA7F5IEiC6Hk5C1M8C+Oe09 4ICYEuuL+gcXPPVc4aYtxp33HvPBFCzPJkGBS2PHaqCQkYIKdWHCtDZ8iLWCOxjc b674lsQk9tM= =a09C -----END PGP SIGNATURE----- From barry at python.org Fri Jul 20 14:27:54 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 08:27:54 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <468A34B7.3080201@astro.princeton.edu> <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> Message-ID: <7B257080-6504-4804-84A1-1EC2F32EB5CB@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 4, 2007, at 3:30 PM, Jeff Breidenbach wrote: >> Maybe a way to think about this is that the canonical url is based on >> the message-id, but then there's some way to distill even this down >> to a tinyurl or simple integer that would be stable in the face of >> full archive regenerations. > > I'd suggest the reverse. Keep the canoncical archive URL short and > sweet, and then use a URL redirection service to map message-id's > to those URLs. It is the archiver's job to make it all work. For > example, > the canonical archive URL might stay exactly the way it is in > pipermail. > But the archival link embedded in the message would instead go > to a redirection service. I agree. My proposed global message id is exactly the canonical archive URL, although it's relative to the archiver's base url, as given in the List-Archive header. > http://mail.codeit.com/pipermail/zcommerce/2002-February/000523.html > http://mail.codeit.com/msgid?002701c4eb3d$07170ca0$3142003e at ADSL > > The one other thing I'd ike to revisit is integration with third party > archival services. There are two obvious integration points; one is a > button in the Mailman list admin user interface that says "archive > with > service X" not unlike the setting in Firefox that basically says > "search > with service X". I think we could define an interface that archive services would have to meet in order to be available to list admins. The site admin would of course have to enable them site-wide first. Why kinds of information would be required? - - List-Archive base url - - Message injection procedure - - Additional subscription procedures The nice thing is that if my global id idea works, the injection process can be completely asynchronous. > The other integration point is the archival link > discussed above. In which case it would be set to something like. > > http://third-party-service/msgid?002701c4eb3d$07170ca0$3142003e at ADSL All we'd need to know is the third party's List-Archive header value. The last part of the path would always be the global message id. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCqSnEjvBPtnXfVAQJq7gQArkmEb3DqrOaRTdYnQ0SCOrqWtiPxNJOd 555+JiHt/mEqPTuS/cF1GfdckwrQXbUJYWeO56dXzfbXtCVaW54h4k/95RI2/mqK HR2BKcoVW/dDfYUd2V2Vbqdc7trVIy3oGdzQb24Pu9bIptqbdVSpnmx8jm9GIOi1 UAkJp+Ff5nc= =lE32 -----END PGP SIGNATURE----- From barry at python.org Fri Jul 20 14:39:34 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 08:39:34 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <1183651769.10813.6.camel@finch.boston.redhat.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 5, 2007, at 12:09 PM, John Dennis wrote: > A little over a year ago I went on a search to find the best open > source > archiver and at that time I came up with Lurker > (http://lurker.sourceforge.net) Since then I believe Lurker has seen a > major new revision. I also believe Lurker is the archiver used by > Debian. > > So if you want to leverage existing open source archiving or at least > look at an example of what would be necessary to allow easy easy > external archiving integration with Mailman you might want to look at > Lurker. I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though. Lurker's GPL2 so that's fine. I'd be quite hesitant about shipping Mailman with Lurker because it's something we don't control and it's not Python. But I would be totally open to working with the Lurker developers on creating an easy bridge between the two systems. Perhaps this dovetails with Jeff's suggestion of easier integration with external archiving systems. Does anybody have contacts with the Lurker community that could cross- post a new thread to get the discussion going? (The same goes for any other archiver out there too.) - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCtBnEjvBPtnXfVAQLgJwP9HNu/r/5YYAGn0HcQAhD8b8plDSpm2tao VcC7tROs0EyjRAQd1b3+hF102FMZzTXF/8LifgETN8K4MD9TXkxNhrTlKjmAUhLG 1tvHZT9oD73aLb81m2SuI3nbp8kQSMncPeMM4u1vGzpXfCYGK4chAPyIJ1Z5MNqj 6byAgVpwZEo= =qjmf -----END PGP SIGNATURE----- From barry at python.org Fri Jul 20 14:43:11 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 08:43:11 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> Message-ID: <94995768-85AC-4F96-8980-B26686B27426@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 8, 2007, at 1:06 AM, Paul Wise wrote: > My personal opinion is that pipermail should be removed and mailman > should not contain a default archiver since there are plenty of good > archivers already (lurker, mhonarc etc). Adding wrappers around them > would be simpler than reimplementing them. My hesitation to this has always been the turnkey question. Pipermail has it's problems but it /does/ allow small sites to get going very quickly with a full(-ish) solution. It may be that most people get their Mailman installation from their distro or hosting service and this is no longer as important. In that case, I still wouldn't chuck Pipermail, but I would try to see if we can adopt Jeff's goal of making the archive selection pluggable and easily selectable by list admins. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCt4HEjvBPtnXfVAQJHQwP+P4KAQaA7uEeISQjFyb3zoMvOWwgoW3zH taWsnVAhVmAF/hJBWDn7JtXwWiLw7ngCtGHp3MBKGBKzBjJP7ZizEMNfziaB+OoO LOyF7sYB+KhKVi+Il7XnHYIjh6DSD8kullP+G/UNtuIsFnNs+aTntndfMKJG2Zct E7M0F1Ok8FE= =xXQJ -----END PGP SIGNATURE----- From barry at python.org Fri Jul 20 14:45:14 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 08:45:14 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <87y7hpt0ng.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <87k5tgxg0j.fsf@athene.jamux.com> <87y7hpt0ng.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <7194735A-3720-43FB-A1AF-35EEB7DAC271@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 9, 2007, at 11:09 PM, Stephen J. Turnbull wrote: > John A. Martin writes: > >> In the absence of a Message-ID >> on an outgoing mail message many if not most MTAs will add one. Why >> not let Mailman anticipate the need to add a Message-ID when >> archiving >> the message rather than leaving it to the outgoing MTA? > > Quite. > > My reason for saying "last resort" is simply that this is not > predictable to third parties. Eg, I send you (a non-subscriber) a > message with CC and no Message-ID. You'd like to find the thread in > the archives. You may as well just do a linear search on that month's > threads. Yep, and I say "tough". Let John complain to Stephen to fix his MTA to add those Message-IDs so Mailman doesn't have to. ;) > An URL based on an MD5 of the message body in theory would work, but > in the presence of non-ASCII bodies, structured MIME, ML digests, and > various MTA autoconversions, that seems fragile. Agreed, and it would do no better, in fact worse, than base32(sha1 (message-id + date)) - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqCuW3EjvBPtnXfVAQKx/AP9EUxDQmp1tiCEqJqVSFWeicq/9lThnMZN 58UUEPA47wPa1SJSk6z7+0vSfqTskwO1Frnn8OJ6X+MJAxCX4Hr86uBOnK9XW2AK byCfeYHBdapGlrsxmPd0so+FFJODWWRu7+yyKTw6ApDwVevatEEIMPlZkMALMv5S axC5ttHfR2E= =c0pw -----END PGP SIGNATURE----- From stephen at xemacs.org Fri Jul 20 15:21:27 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 20 Jul 2007 22:21:27 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> Message-ID: <87tzrznrc8.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > First, I want to avoid talking about file system layout. To me, > that's an implementation detail we needn't worry about right now. Agreed. > How likely is it that two messages with the same message-id and > date are /not/ duplicates? For message id generators that include a time-stamp in the generated id, approximately the same as the probability that two messages with the same message-id are not duplicates, no? > Heck, at that point, I'd feel justified in simply automatically > rejecting the duplicate and chucking it from the archive. I'd rather not go there. There may be applications for the archiver that require that all mail received be filed. Counterproposal: have a "collisions" namespace, and provide an interface for the list owner to decide what to do with them. They could be thrown away, they could be given an alternative global ID somehow and added (eg, the archive page could add a "See probable duplicates too" link), or they could be put into a moderation-like queue for list admins to decide about. > So now, think of the interface to a message store that supports this > addressing scheme. Well it's something like: I don't understand how the calling application is supposed to deal with a DuplicateMessageError exception since it should not change either the Message-ID or the Date if present. I see this as a major problem with any proposal to use only author headers in computing the "global id". > Or by using the global id, or by rejecting messages with duplicate > message ids. Er, the MTA has already accepted it. Do you plan to generate a list manager bounce to the poster? This has the unpleasant misfeature that it could be used to bounce spam off the list manager, since the poster needs to see content to determine whether this is a multiple send or actually the "intended version" after a "fat-finger" send; we already know the message-id isn't good enough. From stephen at xemacs.org Fri Jul 20 15:31:19 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Fri, 20 Jul 2007 22:31:19 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <0FE1D8A1-DFBE-41D3-AE5F-CF0FCF26FB61@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <468A34B7.3080201@astro.princeton.edu> <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> <468BD60A.2020709@Newfield.org> <0FE1D8A1-DFBE-41D3-AE5F-CF0FCF26FB61@python.org> Message-ID: <87sl7jnqvs.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > Second, things can happen to a list > that might cause this sequence number to get corrupted. Add an X-Mailman-Sequence-Number header if not already present. That doesn't deal with your other comments, but as I point out elsewhere, if you don't use *any* Mailman-specific information in the global ID, you have no sane way to handle collisions except throw them away (or make the global ID refer to a collection resource, but that's kinda unintuitive). From barry at python.org Fri Jul 20 15:49:49 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 09:49:49 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <87tzrznrc8.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <87tzrznrc8.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <2EA1C28C-3C27-428C-9A4C-F09039B13A29@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 20, 2007, at 9:21 AM, Stephen J. Turnbull wrote: >> How likely is it that two messages with the same message-id and >> date are /not/ duplicates? > > For message id generators that include a time-stamp in the generated > id, approximately the same as the probability that two messages with > the same message-id are not duplicates, no? Good point, though clearly not all message-ids have timestamp information in them. It does help explain why I see 600-odd more collisions when taking other data into account too. I've modified my script to sort collisions and dupes into maildir folders, so I'll take a closer look when that finishes running (it takes a long time to slog through all 5 mboxes, even on a fairly zippy dual-G5). >> Heck, at that point, I'd feel justified in simply automatically >> rejecting the duplicate and chucking it from the archive. > > I'd rather not go there. There may be applications for the archiver > that require that all mail received be filed. True. It would ultimately be an archiver policy though. > Counterproposal: have a "collisions" namespace, and provide an > interface for the list owner to decide what to do with them. They > could be thrown away, they could be given an alternative global ID > somehow and added (eg, the archive page could add a "See probable > duplicates too" link), or they could be put into a moderation-like > queue for list admins to decide about. I like this. >> So now, think of the interface to a message store that supports this >> addressing scheme. Well it's something like: > > I don't understand how the calling application is supposed to deal > with a DuplicateMessageError exception since it should not change > either the Message-ID or the Date if present. > > I see this as a major problem with any proposal to use only author > headers in computing the "global id". Mailman would probably log and ignore DuplicateMessageErrors. It wouldn't be Mailman's responsibility to ensure the message gets archived, although I concede that as currently defined, you could end up with list copies that had a global id header that wasn't unique. OTOH, if the archiver implements a collision resolution policy such as a 'collisions' namespace, it wouldn't ever raise DuplicateMessageError. >> Or by using the global id, or by rejecting messages with duplicate >> message ids. > > Er, the MTA has already accepted it. Do you plan to generate a list > manager bounce to the poster? This has the unpleasant misfeature that > it could be used to bounce spam off the list manager, since the poster > needs to see content to determine whether this is a multiple send or > actually the "intended version" after a "fat-finger" send; we already > know the message-id isn't good enough. Yes, this wouldn't be an MTA bounce, it would be a Mailman bounce. But it would have to be subject to the same bounce rules as any other auto-response which could be used as a spam vector, e.g. limit the number of bounces per time period and don't include the entire original message in the bounce (as both can be, and are used as spam vectors). - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqC9fnEjvBPtnXfVAQLkEQQAhdu0BIvpRvTk92m9J/sbHVRSRxBGMqta Cm57WyRJGBxPV3xTE4ghVzXdDyIEvUjKimRTEWbeX60WqROL6FPsmAnwmsYbW3mw 8hqNXj+SpHP+1GIYnYgY9txiM75fHDa5T0VsjpcXAwtjeepHouXAEWbegBUrIzHt EBp5YCMqxv8= =5tjc -----END PGP SIGNATURE----- From nigel.metheringham at dev.intechnology.co.uk Fri Jul 20 15:17:26 2007 From: nigel.metheringham at dev.intechnology.co.uk (Nigel Metheringham) Date: Fri, 20 Jul 2007 14:17:26 +0100 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> Message-ID: On 20 Jul 2007, at 13:39, Barry Warsaw wrote: > I've looked at a few lurker archivers and I wasn't blown away by its > user interface. That's apparently highly configurable though. I'd be inclined to agree wrt user interface. Documentation regarding this, and anything else to do with lurker, appears somewhat scarce - speaking as someone who has just migrated the exim.org lists to using lurker archiving. [previously we used mailman with the MHonArc/pipermail hybrid] I am considering starting a set of pages within our wiki about use of lurker (we tend to cover almost everything else about mail so why not that). > Lurker's GPL2 so that's fine. I'd be quite hesitant about shipping > Mailman with Lurker because it's something we don't control and > it's not Python. But I would be totally open to working with the > Lurker developers on creating an easy bridge between the two systems. > Perhaps this dovetails with Jeff's suggestion of easier integration > with external archiving systems. Integration with externals feels like a good way to go. > Does anybody have contacts with the Lurker community that could cross- > post a new thread to get the discussion going? The ML appears... lacking in vigor.. BTW lurker gives all messages an ID which is 3 parts separated by periods. The first part is a date field - ie 20070720, the second part is the receive time, UTC, as 6 digits, and the final part is some form of hex id. The nice part is if you quote just the first (or first 2) parts of message ID you get messages around that time... Nigel. -- [ Nigel Metheringham Nigel.Metheringham at InTechnology.co.uk ] [ - Comments in this message are my own and not ITO opinion/policy - ] From barry at python.org Fri Jul 20 16:07:48 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 10:07:48 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <87sl7jnqvs.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <468A34B7.3080201@astro.princeton.edu> <13B9C232-8295-4533-B49B-205B901AA8E7@python.org> <468BD60A.2020709@Newfield.org> <0FE1D8A1-DFBE-41D3-AE5F-CF0FCF26FB61@python.org> <87sl7jnqvs.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <1BC6A9D0-144B-4EE1-90C7-EEBF00396B22@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 20, 2007, at 9:31 AM, Stephen J. Turnbull wrote: > Barry Warsaw writes: > >> Second, things can happen to a list >> that might cause this sequence number to get corrupted. > > Add an X-Mailman-Sequence-Number header if not already present. > > That doesn't deal with your other comments, but as I point out > elsewhere, if you don't use *any* Mailman-specific information in the > global ID, you have no sane way to handle collisions except throw them > away (or make the global ID refer to a collection resource, but that's > kinda unintuitive). I'd probably call it X-List-Sequence-Number and I'd have to ensure that archive copy had that header in it. OTOH, if I'm going to go to the trouble of adding this sequence number, why not just calculate a (more likely) gid for the message myself? If I did that, I could use a tinyurl scheme and get much shorter urls. The archiver would then be obliged to use my X-List-GID header verbatim. I've been pushing for calculating this using non-Mailman headers because I'd /like/ for a client receiving the non-list copy to be able to make the same calculation. OTOH, maybe we can have it both ways. So, we calculate the sequence number and generate the following headers: X-List-Sequence-Number: 801 X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI The latter is composed of purely author generated data, the former is supplied by Mailman. Assuming we also had this header: List-Archive: http://archive.example.com/gid/ then the following url would point to the same exact resource: http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801 If however we subsequently got a collision, then these two urls would address different resources. E.g.: X-List-Sequence-Number: 2112 X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI Now the two messages would still be addressable by their respective urls: http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801 http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/2112 but http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI would be a disambiguation page. For a web u/i it would be an HTML list containing relative links to '801' and '2112'. A RESTful XML document would contain the set of links to the subordinate pages. A client of the archive.example.com service would have to be prepared to handle disambiguation pages if it used only the author generated GID, but it would be guaranteed that the full url would lead directly to one and only one email message. Archives would have to recognize the X-List-Sequence-Number and honor it whenever it regenerated its archives so that the urls would remain stable. Thinking about this more (and I've been up since about 3:30am so I'm a little foggy right now ;), we may want to optimize for fewer dupes rather than fewer collisions, or maybe it doesn't matter. It would be interesting to see how big the message-id buckets are when only using the Message-ID header. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqDBtHEjvBPtnXfVAQLOggQAhIjxlU2jPDb5K8Lfe3NThjgwKiPblqtm UurUj+AZCffS1ewGDlV6y3GGRnHEzdVSIVvAiATEGTRVG8Zzbbev3GXs0EKYiEyL FZreNcPqDAPL0KSGw73RdAiwZuszfQcMTsSwOx98zS9Kz0NtbntYQTuqQZwo7wAW 3KeGe2PkpaI= =yhaZ -----END PGP SIGNATURE----- From barry at python.org Fri Jul 20 16:26:59 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 10:26:59 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> Message-ID: <1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 20, 2007, at 9:17 AM, Nigel Metheringham wrote: > > On 20 Jul 2007, at 13:39, Barry Warsaw wrote: >> I've looked at a few lurker archivers and I wasn't blown away by its >> user interface. That's apparently highly configurable though. > > I'd be inclined to agree wrt user interface. Documentation regarding > this, and anything else to do with lurker, appears somewhat scarce - > speaking as someone who has just migrated the exim.org lists to using > lurker archiving. [previously we used mailman with the MHonArc/ > pipermail > hybrid] I noticed that! There's no documentation link on the site. I also saw your question regarding getting a message out of lurker given its message-id. When I checked yesterday I didn't see a response. > I am considering starting a set of pages within our wiki about use of > lurker (we tend to cover almost everything else about mail so why not > that). That would be cool. Feel free to add a link to your pages on the Mailman wiki, perhaps here: http://wiki.list.org/display/DOC/Home >> Does anybody have contacts with the Lurker community that could >> cross- >> post a new thread to get the discussion going? > > The ML appears... lacking in vigor.. > > BTW lurker gives all messages an ID which is 3 parts separated by > periods. The first part is a date field - ie 20070720, the second part > is the receive time, UTC, as 6 digits, and the final part is some form > of hex id. The nice part is if you quote just the first (or first 2) > parts of message ID you get messages around that time... Obviously Mailman can't know the second and third parts so it can't use them in its list copies. I dislike using YYYMMDD because of the high number of collisions. I should make clear that what I'm really proposing is not specific to Mailman or any particular archiver. It's really an interface to a generic message store. We succeed by convincing other mailing list software and archivers to adopt the same standard so that they can interoperate seamlessly. We can perhaps have the first implementations of this defacto standard (any latent RFC shepherds out there? :). We get everyone else to adopt it when we take over the world. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqDGNHEjvBPtnXfVAQIwVQQAlwcmmuoXz/vKlpdu27wCHnfpwhhrQMmn DWMEayuJsG+qg3GvkwyHGkgTBalENdDWWAQpPE9Zf9nmY24FyqhqRpe/QhOCajBV 4+lvXR1FARur4y4E9Lzcjz1TzX3lkaxx3dVCqpOtJxNVVvv442eYsLf11E3Z+wxY m+ootMkR5pE= =y4za -----END PGP SIGNATURE----- From nigel.metheringham at dev.intechnology.co.uk Fri Jul 20 16:38:56 2007 From: nigel.metheringham at dev.intechnology.co.uk (Nigel Metheringham) Date: Fri, 20 Jul 2007 15:38:56 +0100 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> <1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org> Message-ID: <0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk> On 20 Jul 2007, at 15:26, Barry Warsaw wrote: >> BTW lurker gives all messages an ID which is 3 parts separated by >> periods. The first part is a date field - ie 20070720, the second >> part is the receive time, UTC, as 6 digits, and the final part >> is some form of hex id. The nice part is if you quote just the >> first (or first 2) parts of message ID you get messages around that >> time... > > Obviously Mailman can't know the second and third parts so it can't > use them in its list copies. I dislike using YYYMMDD because of the > high number of collisions. Its used as part of a UID, but has the nice feature of allowing easy queries as to other messages at that time. If the archiver is local you also have the information for part 2 of the UID - lurker takes it from the From_ line. Nigel. -- [ Nigel Metheringham Nigel.Metheringham at InTechnology.co.uk ] [ - Comments in this message are my own and not ITO opinion/policy - ] From barry at python.org Fri Jul 20 16:52:17 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 10:52:17 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> <1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org> <0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk> Message-ID: <1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Nigel, On Jul 20, 2007, at 10:38 AM, Nigel Metheringham wrote: > On 20 Jul 2007, at 15:26, Barry Warsaw wrote: >>> BTW lurker gives all messages an ID which is 3 parts separated by >>> periods. The first part is a date field - ie 20070720, the second >>> part is the receive time, UTC, as 6 digits, and the final part >>> is some form of hex id. The nice part is if you quote just the >>> first (or first 2) parts of message ID you get messages around that >>> time... >> >> Obviously Mailman can't know the second and third parts so it can't >> use them in its list copies. I dislike using YYYMMDD because of the >> high number of collisions. > > Its used as part of a UID, but has the nice feature of allowing easy > queries as to other messages at that time. That should definitely be a way to traverse to the message, but it's not the message's global id (a.k.a. canonical address relative to the base url of the message store). An archiver could provide other ways to traverse to the message, such as: /barry at python.org/ to see all messages by me /barry at python.org/mailman-developers/20070720 to see all messages by me today to this mailing list /Subject?Improving%20the%20archives&sort=thread to find all the messages in this thread regardless of when they were posted etc. > If the archiver is local you also have the information for part 2 > of the > UID - lurker takes it from the From_ line. Mailman gets the From_ line before passing off to the archiver. But that's interesting, does lurker /require/ the From_ line? - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqDMInEjvBPtnXfVAQKJFAP/Y3FsBIXrSaRZ85eCl+pVTZxez2uRn0KB 2OMBV6vS/qC8K1R/myeGpBVr44yE/AfTa+kf+MLSlIlMpJdUlWDMWw2G90IPy1gv t1VGrwbVPmOlLFxF8kIsi6NKIZpKoJrJVdQnSc+uPCqowIDU9FQ57+2hrH8HayTS ISAZ0FTgAzk= =sp+m -----END PGP SIGNATURE----- From nigel.metheringham at dev.intechnology.co.uk Fri Jul 20 16:59:03 2007 From: nigel.metheringham at dev.intechnology.co.uk (Nigel Metheringham) Date: Fri, 20 Jul 2007 15:59:03 +0100 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> <1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org> <0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk> <1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org> Message-ID: On 20 Jul 2007, at 15:52, Barry Warsaw wrote: > Mailman gets the From_ line before passing off to the archiver. > But that's interesting, does lurker /require/ the From_ line? > Well lurker handles Maildir - no From_ but the same info is in the filename, and it can take messages on stdin without a From_ - at which point I guess its either faking it (from the headers) or making things up. Nigel. -- [ Nigel Metheringham Nigel.Metheringham at InTechnology.co.uk ] [ - Comments in this message are my own and not ITO opinion/policy - ] From barry at python.org Fri Jul 20 17:16:19 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 11:16:19 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> <1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org> <0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk> <1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 20, 2007, at 10:59 AM, Nigel Metheringham wrote: > On 20 Jul 2007, at 15:52, Barry Warsaw wrote: >> Mailman gets the From_ line before passing off to the archiver. >> But that's interesting, does lurker /require/ the From_ line? >> > > Well lurker handles Maildir - no From_ but the same info is in the > filename, and it can take messages on stdin without a From_ - at > which point I guess its either faking it (from the headers) or making > things up. Cool. I wonder if lurker is compatible with Python 2.5's mailbox.Maildir implementation and whether the two could share the maildirs. Thanks for the information! - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqDRw3EjvBPtnXfVAQJHXwP/SiKhWiZ57thW84RBUWt9QVjf4KISEfRJ H5lioRVPYYegiJp7rf/08TutkNsxGCHzRd/cdMEFXMkrCAdifLQ2QIdS4LRvEKyY eRbVHcmxyAlwMbyUq36W+pcH2MutTM64HKNrbL9YRSTaLyMA11FnmaiGIK3RMnbM AqtLGRSJ8Ec= =D8oM -----END PGP SIGNATURE----- From stephen at xemacs.org Fri Jul 20 19:21:54 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Sat, 21 Jul 2007 02:21:54 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <2EA1C28C-3C27-428C-9A4C-F09039B13A29@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <87tzrznrc8.fsf@uwakimon.sk.tsukuba.ac.jp> <2EA1C28C-3C27-428C-9A4C-F09039B13A29@python.org> Message-ID: <87odi7ng7h.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > But it would have to be subject to the same bounce rules as any other > auto-response which could be used as a spam vector, e.g. limit the > number of bounces per time period and don't include the entire > original message in the bounce But that prevents detecting a prematurely sent message, which is presumably a common use case for genuine collisions. I just don't think bouncing back is going to be very useful; either you don't give the user the information he needs to figure out what happened, or you give the spammers a vector. From Dale at Newfield.org Fri Jul 20 01:10:25 2007 From: Dale at Newfield.org (Dale Newfield) Date: Thu, 19 Jul 2007 19:10:25 -0400 Subject: [Mailman-Developers] Potential solution to protecting email addresses in archive... Message-ID: <469FEF61.7050204@Newfield.org> http://mailhide.recaptcha.net/ -Dale D...@Newfield.org From barry at python.org Fri Jul 20 22:52:41 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 16:52:41 -0400 Subject: [Mailman-Developers] Mailman roadmap In-Reply-To: References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> Message-ID: <0C108CCF-5DF9-4E1F-A88D-39C91757FBA7@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 It's catch up on email day! On Jul 9, 2007, at 9:35 AM, Ian Eiloart wrote: > WRT 3.0, for enterprise and education purposes, it's important to > be able > to hook into existing authentication and authorisation mechanisms. > For us, > that means LDAP - at least for authentication. On the other hand, > we also > have external people using our lists, so we need to be able to > either put > them into an SQL database which will work in conjunction with LDAP, > or to > add a separate LDAP tree for them, or something similar. So, my standard answer to this will be, Mailman will provide interfaces for these things and ensure that the application is written to only ask these questions through the interface. It will be easy to plug in different backends, so if someone were to write an LDAP or hybrid backend, Mailman would work with it. This will be much easier than the current hacks required for Mailman 2.1. The goal would be to support such plugins through Python eggs, so that such extensions are easy to install and enable. > Something that I've mentioned before, is the importance of preventing > collateral spam. So, I'd like to be able to have my MTA ask Mailman > whether > a particular email address is permitted to post to a particular > list, at > SMTP time. I'm using Exim, which could call an external python > script, but > I'd rather be able to issue an SMTP callout to a running daemon, for > efficiency. The callout would be executed after each "RCPT TO". Yes, you'll have this capability. Really, you could do it now with a bit of Python coding. I'm currently jonesing for a RESTful interface to Mailman 3 for controlling it, asking it questions, etc., which of course the site-administrator would have to enable. The only other viable alternatives I see are XMLRPC, or the current 'write some custom Python script' approach. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqEgmXEjvBPtnXfVAQIp7gP/Xh2SnvuaQScrZdZx2YvCQfPA4IwpdjMN 3yEEw3BOVtvcVM2/mTXel9ZFMx0I9sEvi5U+Fk0E5Bk8/KQ/Nr9Y7SxWxx3mF1UE ssmLVeNt1k5OziufLwATQcsqAV47YNdj1vcJnhlPuq5k+LMgNDA0XLG2dqFgJ/z5 p1dSCJ96HZ4= =7TmC -----END PGP SIGNATURE----- From barry at python.org Fri Jul 20 22:56:17 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 16:56:17 -0400 Subject: [Mailman-Developers] Mailman roadmap In-Reply-To: <200707091606.15858.thijs@debian.org> References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> <200707091606.15858.thijs@debian.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 9, 2007, at 10:06 AM, Thijs Kinkhorst wrote: > These sound like sensible plans and I'm curious about what 2.2 and > 3.0 will > bring. However, my question is whether we can expect some 2.1.x > releases in > the short term (like 2.1.10 you mentioned). As you say it will take > quite > some while for 2.2 to be released, and we'd like to get the fixed > bugs in the > 2.1.x branch to our users in the meantime. > > Regular 2.1.x releases with assorted fixes would be welcome to not > scare users > away from Mailman while we're waiting for the "big" releases. Agreed. Mark and Tokio will decide when 2.1.10 is ready, and I suspect that we will have perhaps a few more point releases before 2.2 and 3.0 are out. I would like the upgrade from 2.1.x to 2.2 to be as easy as 2.1.x to 2.1.x+1. The upgrade to 3.0 will likely require running an 'export' command in 2.x followed by an 'import' command in 3.0. The really tricky parts will be resolving conflicts when merging users. I haven't thought this far ahead, but it may be that there's an intervening conflict resolution step, or resolution strategies in the 'import' command. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqEhcXEjvBPtnXfVAQIKTAP/dpA6VkHY5qkdV9dx6YRQMEYHXuDfcqly vhSzJ1tJXDIoXkCYa4uMcDTbyFKM3M+ytWR3LbcckGpsApikKgIG8rJz+ik3qIZc rlm8c4fuevuP9M+uw3S4Z9xK8mxpEBaVvn3rfVywSq9dm4C5zJO0meQMPRz8IPRj T/J1z613xdI= =ToXV -----END PGP SIGNATURE----- From barry at python.org Fri Jul 20 22:59:48 2007 From: barry at python.org (Barry Warsaw) Date: Fri, 20 Jul 2007 16:59:48 -0400 Subject: [Mailman-Developers] Mailman roadmap In-Reply-To: <4692C472.1020201@uni-paderborn.de> References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> <4692C472.1020201@uni-paderborn.de> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 9, 2007, at 7:27 PM, Arne Schwabe wrote: > At our University we developed a customized mini Interface called > 'simple' Interface. The normal mailman Interface is still there, > called > 'expert admin'. A (non working) demo is here: > https://lists.uni-paderborn.de/listadm/demo.html The code does not use > the mailman template system nor does it have multi language abilities. > It even includes code specific to our installation. (We have a > membership class that maps users to user in ldap and can create > dynamic > list with users from ldap + static users) > > But maybe something like this should be included in future Mailman > installation. Either a static simple interface or even a customizable > simpe interface that is sufficent for 95% of the people (with well > chosen defaults for your university/organisation) This is exactly the kind of thing that would be nice to have. I'd also like for 3.0 to have a 'simple' and 'expert' list admin interface. BTW, your page makes me wish either 1) I spoke German :) or 2) Google Translate could handle your URL! - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqEiRHEjvBPtnXfVAQIfLwQAr25h5wkRKWtvOAXOVfONnFlQEcyG1rYb tTomQyD5FlSAZO6MeoPEW04QtwWjv9Q3tpxFlf4tRqaPmkdb6pd6WgLy7xkVlSn5 IlTSQdnehB2CNHOKHNMiXHrl45OxxDU9PDPPYyMhaZDNPzaEfT4ad8xay3ktn2cc 2zmW/oLziUo= =HJOq -----END PGP SIGNATURE----- From barry at python.org Sat Jul 21 22:25:07 2007 From: barry at python.org (Barry Warsaw) Date: Sat, 21 Jul 2007 16:25:07 -0400 Subject: [Mailman-Developers] Major updates to Mailman 3.0 branch Message-ID: <61705022-9B57-47A5-BC3C-BCE160EBD80A@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've just merged in my 'setuptools' branch to the official Mailman 3 branch. This means that all the autoconf-based building cruft is gone, dead, and buried. Can you say "Yay"? :) Instead, Mailman is now a setuptools based project, which is rapidly becoming the standard for Python applications and libraries. There are lots and lots of benefits to this change; for developers, the most immediate benefit is that there's no more configure/make/ makeinstall dance for every little change. It means you can build a 'development' installation of Mailman and edit and test right in place. This will hugely reduce the overhead for developing the code. Also gone are the C wrapper programs, so Mailman is now a pure Python application. With the change to wsgi-based web integration, and maildir delivery from the MTA, we really don't need them any more. You'll notice a bunch of other things have disappeared too, like the 3rd party packages we were distributing in the misc/ directory. Instead, these are downloaded on demand from the Python Cheeseshop . All the packages Mailman depends on live in the Cheeseshop so we don't need to distribute them any more. And now that we're a setuptools-based project, when you build Mailman (see below), these dependent packages will be automatically downloaded and installed as necessary. You'll need a net connection for the initial build, but after that, once the packages are installed, you're good to go. To prepare your existing branch for the update, start by doing a 'make distclean' followed by a 'bzr revert'. IMPORTANT - don't do this if you have uncommitted changes! 'bzr stat' should now report no changes. Next, cd into your 'misc' directory and remove the following directories: - - Elixir-0.3.0 - - SQLAlchemy-0.3.3 - - setuptools-0.6c3 - - zope.interface-3.3.0.1 These are the unpacked dependent packages and you don't need them any more (in fact, they'll get in your way now). Next, remove any residual .pyc files laying around, via: % find . -name \*.pyc -print | xargs rm Now do your 'bzr pull' to get the latest 3.0 branch changes. If you have local modifications, you'll need to do a 'bzr merge' and resolve any conflicts. To see if everything's cool, pick a 'development' directory. I usually use a subdirectory called 'staging' in my 3.0 working tree. This development directory can be anything, but then do this: % export PYTHONPATH= for me "" would be `pwd`/staging. Make sure this directory exists. Next do this: % python2.5 setup.py develop --install-dir After churning for a bit, downloaded some stuff, etc., look in your staging directory. You'll have a bin directory there with all the Mailman command line scripts, but all the code will remain in your working tree. You can now do this to run the test suite: % /bin/testall You should see 72 tests passing and no failures. Though you probably won't get very far, the way this is going to work in deployed installations is that you'll have to run bin/ make_instance to create some directories and do a few other things that the configure and Makefile.in's used to do. After running that, you'll have an etc/mailman.cfg file that you can tweak for your installation. Note that you don't need to do this for bin/testall because the testing infrastructure creates a temporary instance that it uses. Let me now if you have any problems. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqJro3EjvBPtnXfVAQL0OgQAktryl9zYZlZrm9dC644EL2hF+FMHs50v 2JTh2w6UhULAdpj+xneKR04pEfCw6HHNult03jo6NoNjYjMmzpUycDHXj7e0RaKE mbhMkX2v2b6d01OsMrAbAyWBTZH+2rtmjEKANd/1/+LlIdy3KlJe09m+xgNY9VsV keEIEdVQcYc= =J3Mq -----END PGP SIGNATURE----- From amk at amk.ca Sat Jul 21 23:16:11 2007 From: amk at amk.ca (A.M. Kuchling) Date: Sat, 21 Jul 2007 17:16:11 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> <1301AF8F-CB62-4CC1-B792-CF4A898F0AAC@python.org> <0E448BC5-46D0-433B-89A5-32BC93659D72@dev.intechnology.co.uk> <1139EDFB-2D9B-41F1-9DCE-670D82389448@python.org> Message-ID: <20070721211611.GA26947@andrew-kuchlings-computer.local> On Fri, Jul 20, 2007 at 11:16:19AM -0400, Barry Warsaw wrote: > Cool. I wonder if lurker is compatible with Python 2.5's > mailbox.Maildir implementation and whether the two could share the > maildirs. Thanks for the information! It had better be -- Maildir has a published specification. If there's an incompatibility, that would be a bug in either mailbox.py or lurker. --amk From fil at rezo.net Sun Jul 22 11:18:03 2007 From: fil at rezo.net (Fil) Date: Sun, 22 Jul 2007 11:18:03 +0200 Subject: [Mailman-Developers] Potential solution to protecting email addresses in archive... In-Reply-To: <469FEF61.7050204@Newfield.org> References: <469FEF61.7050204@Newfield.org> Message-ID: > http://mailhide.recaptcha.net/ excellent > D href="http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0=" > onclick="window.open('http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0=','','toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); > return false;" title="Reveal this e-mail address">...@Newfield.org window.open(this.href) will do :-) -- Fil From barry at python.org Sun Jul 22 15:26:59 2007 From: barry at python.org (Barry Warsaw) Date: Sun, 22 Jul 2007 09:26:59 -0400 Subject: [Mailman-Developers] Potential solution to protecting email addresses in archive... In-Reply-To: References: <469FEF61.7050204@Newfield.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 22, 2007, at 5:18 AM, Fil wrote: >> http://mailhide.recaptcha.net/ > > excellent I like this one better: http://www.thehumorarchives.com/joke/Best_Captcha_Ever :) - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqNbJHEjvBPtnXfVAQLoUQQAjIAt/75iK0F0lViDLAFwaeb25H5INKCY kIs/jt4shtEmXpXbW81JXHD4reaDcB8UOnv9cavtorPaOXaIaGTds/m4yUdqjlli yKA9LLTEd0ys6LJhuwh774m2XpPLpi/V6i6owf8ojTtW/pm8C62G2/Zlvo8wq10p 8CsZxlkVaq8= =l7LJ -----END PGP SIGNATURE----- From terri at zone12.com Sun Jul 22 18:33:04 2007 From: terri at zone12.com (Terri Oda) Date: Sun, 22 Jul 2007 12:33:04 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> Message-ID: <74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com> On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote: > I've looked at a few lurker archivers and I wasn't blown away by its > user interface. That's apparently highly configurable though. I've been doing a lot of thinking about interface, and I'm coming to the conclusion that something more like a web bulletin board is probably the way to go, given that people use them all the time without much trouble and with a fairly minimal amount of whining. ;) I'm trying to use interfaces to things like comment systems (which are often threaded -- picture the slashdot stuff, maybe?) and popular boards like phpbb (which isn't threaded beyond separate topics) as guides to how people usually deal with conversations on the web. It'd actually be fairly easy, at that point, to just put a posting interface into the archives (yes, you'd have to be logged in, and yes, this means your password becomes that bit more valuable because someone having it can pose as you to the list... but they could do that by spoofing your email address so I'm not too concerned). But then people who don't like email or just want to pop by and check the list quickly could actually use mailman like a web board, which is something I'm pretty sure would get used (I know my users have asked for it in the past). I've been drafting simple prototype interfaces in my head, trying to keep potential architectures in mind. I'm hoping I'll have time this week to code some up HTML and see how well they actually work when they're not just inside my head. :) Terri From Dale at Newfield.org Sun Jul 22 22:40:04 2007 From: Dale at Newfield.org (Dale Newfield) Date: Sun, 22 Jul 2007 16:40:04 -0400 Subject: [Mailman-Developers] Potential solution to protecting email addresses in archive... In-Reply-To: References: <469FEF61.7050204@Newfield.org> Message-ID: <46A3C0A4.4070302@Newfield.org> Fil wrote: >> D> href="http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0=" >> >> onclick="window.open('http://mailhide.recaptcha.net/d?k=01Qtvu7BFKxAunezLXAq0QPA==&c=QjjpEgddAt0UK7mq_dl1B-AnlzQr8HHSAY7jwMSGwJ0=','','toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); >> >> return false;" title="Reveal this e-mail address">...@Newfield.org > > window.open(this.href) will do :-) Yeah--that gunk is what they suggest as a replacement, but not what I ended up using. Just the url is sufficient (albeit long). (Since I'm depending upon outside resources to make this work, why not rely on *both* tinyurl.com *and* recaptcha.net ? :-) -Dale Newfield http://tinyurl.com/2r49tj From Dale at Newfield.org Sun Jul 22 23:17:32 2007 From: Dale at Newfield.org (Dale Newfield) Date: Sun, 22 Jul 2007 17:17:32 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> <74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com> Message-ID: <46A3C96C.3060106@Newfield.org> Terri Oda wrote: > I've been doing a lot of thinking about interface, and I'm coming to > the conclusion that something more like a web bulletin board is > probably the way to go For public lists, the answer may lie in external tools like nabble.com or mailinglistarchive.com Of course, that doesn't help for lists wishing to keep their content private. -Dale From jag at fsf.org Mon Jul 23 15:32:44 2007 From: jag at fsf.org (Joshua 'jag' Ginsberg) Date: Mon, 23 Jul 2007 09:32:44 -0400 Subject: [Mailman-Developers] Mailman roadmap In-Reply-To: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> References: <3123AE21-CE74-4A62-AA1E-E4CB89B92C0C@python.org> Message-ID: <46A4ADFC.9020209@fsf.org> I apologize for the late chiming in here. I'd like to propose the XMLRPC extension for Mailman 2.2 that has been developed over the last two years. I have in the last couple months updated the patch to 2.1.9 and it's code impact is really quite minimal; it's really quite standalone. And for those looking for ways to interact with the Mailman database from other applications, it provides the necessary interfaces for administration and moderation functions. Thoughts? -jag Barry Warsaw wrote: > Now that we've successfully navigated the switch to Bazaar, it's time > to lay out plans for future Mailman releases. I've talked to several > people about what to do about Mailman's future and I'd like to take > this opportunity to describe my thoughts and get your feedback. > First some background. > > Mailman 2.1 is (shockingly) four and a half years old, having been > initially released on 30-Dec-2002. The last release in the series, > 2.1.9 was made almost a year ago. In the meantime, Mark and Tokio > have been doing a great job maintaining the 2.1 branch, with several > important patches in the tree now that will eventually become > 2.1.10. The problem of course is that we can't add any new features > to the 2.1 family , so we should be thinking about a new major > release. > > I've been making good progress on the SQAlchemy/Elixir version, which > will finally get rid of pickles and put Mailman on a Real Database > (tm). It's been clear to me for a while that this branch will have a > unified user database. It simply makes no sense to build the > database back-end without once and for all fixing this design > constraint. I've always said that the unified user database will be > in Mailman 3, and thus this branch is indeed called "Mailman 3.0". > > I've been slowly building things back up from the ground floor. The > basic data model is in pretty good shape and I'm taking a religious > test-driven approach to making things work again. But the branch > still needs a lot of work, and I have no ETA for Mailman 3.0. > > In the meantime, Andrew Kuchling and others have volunteered to work > on modernizing the Mailman web u/i, and Terri recently started a > thread discussing updates to the archiver. I think it makes sense to > bless these efforts, towards the goal of releasing them in Mailman > 2.2. I intend to create an official Mailman 2.2 branch in bzr where > these efforts can land as they mature. My hope of course is that > we'll also be able to use much of this new code for Mailman 3. > > I'd like to keep the changes for 2.2 focused on the web u/i and > archiver, with a small number of additional features to be > determined. Mailman 2.2 should see no changes to the basic > architecture or 'database'; we'll continue to use pickles by default > for Mailman 2.2. While I won't rule out other new features, I want > to be very picky about those that are accepted for 2.2, and would not > feel bad at all if we rejected or deferred until 3.0 most of those > proposed. Criteria for other 2.2 features must include minimal code > impact with a high degree of reliability and stability. > > I plan on updating the wiki pages to reflect this thinking, but I > would like to get feedback from y'all about the plan. It would be > awesome if we could see a release of Mailman 2.2 some time in late > 2007 or early 2008. > > Comments, question? > > -Barry > _______________________________________________ Mailman-Developers mailing list Mailman-Developers at python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jag%40fsf.org Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 254 bytes Desc: OpenPGP digital signature Url : http://mail.python.org/pipermail/mailman-developers/attachments/20070723/35001c17/attachment.pgp From jeff at jab.org Tue Jul 24 08:02:46 2007 From: jeff at jab.org (Jeff Breidenbach) Date: Mon, 23 Jul 2007 23:02:46 -0700 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> Message-ID: > Notice that of 325146 total messages, 624 of them had no message-id > header. Even if you aggregate dup+col, you're still looking at a > total duplicate rate of 0.29%. Message ID's are supposed to be unique. This is discussed in in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places. If that's not the case, the mail transfer agent is broken. I think it's better to go ahead and use the mesage-id, rather than concoct yet another "this time we mean it!" unique identifier. This is a cost/benefit thing; the cost is some real world collisions, the benefit is a conceptually simpler system. Conceptually simpler things are good especially when implemented all over the place. Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. This sounds like an internet wide interoperability issue as much as something mailman specific. Why not come up with a scheme usable by all list servers? And also describe a specification third party archival services can comply to. Besides, I've always wanted to help write an RFC. If we go that route, it would be good to get input from a range of people - one person I'd suggest is Earl Hood, author of mhonarc. Thoughts? Jeff While I'm almost tempted to ignore a > hit rate that low, if you think of an archive holding 1B messages, > you still get a lot of duplicates. > > OTOH, the rate goes down even lower if you consider the message-id > and date headers. (Note, I did not consider messages missing a date > header). How likely is it that two messages with the same message-id > and date are /not/ duplicates? Heck, at that point, I'd feel > justified in simply automatically rejecting the duplicate and > chucking it from the archive. > > I spent a /little/ time looking at the physical messages that ended > up as true collisions. Though by no means did I look at them all, > they all looked related. For example, with strategy 2 some messages > look like they'd been inadvertently sent before they were completed. > I need to see if there's any similarities in MUA behind these, but > again, I think we might be able to safely assume that collisions on > message-id+date can be ignored. > > That leads me to the following proposal, which is just an elaboration > on Stephen's. First, all messages live in the same namespace; they > are not divided by target mailing list. Each message has two > addresses, one is the Message-ID and one is the base32 of the sha1 > hash of the Message-ID + Date. As Stephen proposes, Mailman would > add these headers if an incoming message is missing them, and tough > luck for the non-list copy. The nice thing is that RFC 2822 requires > the Date header and states that Message-ID SHOULD be present. > > Why the second address? First, it provides as close to a guaranteed > unique identifier as we can expect, and second because it produces a > nearly human readable format. For example, Stephen's OP would have a > second address of > > >>> mid > '<87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp>' > >>> date > 'Wed, 04 Jul 2007 16:49:58 +0900' > >>> # XXX perhaps strip off angle brackets > >>> h = hashlib.sha1(mid) > >>> h.update(date) > >>> base64.b32encode(h.digest()) > 'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI' > > I like base32 instead of base64 because the more limited alphabet > should produce less ambiguous strings in certain fonts and I don't > think the short b64 strings are short enough to justify the > punctuation characters that would result. While RFC 3548 specifies > the b32 alphabet as using uppercase characters, I think any service > that accepts b32 ids should be case insensitive. A really Postel-y > service could even accept '1' for 'I' and '0' for 'O' just to make it > more resilient to human communication errors. > > I'd like to come up with a good name for this second address, which > would suggest the name of the X- header we stash this value in. X- > B32-Message-ID isn't very sexy. Maybe X-Message-Global-ID, since I > think there's a reasonable argument to make that for well-behaved > messages, that's exactly what this is. > > So now, think of the interface to a message store that supports this > addressing scheme. Well it's something like: > > class MessageStore(Interface): > def store_message(message): > """Store the message. > > :raises ValueError: when the message is missing either the > Message-ID > header or a Date header. > :raises DuplicateMessageError: when a message in the store > already has > a matching Message-ID and Date. An archive is free to raise > this exception > for duplicate Message-IDs alone. > """ > > def get_message_by_global_id(key): > """Locate and return the message from the store that matches > `key`. > > :param key: The Global ID of the message to locate. This is > the > base32 encoded SHA1 hash of the message's Message-ID and Date > headers. > :returns: The message object matching the Global ID, or None > if there > is no such match. > """ > > def get_messages_by_message_id(key): > """Return the set of messages with a matching Message-ID `key`. > > :param key: The Message-ID of the messages to locate. > :returns: The set of all messages in this store that have > the given > Message-ID. If none such matches are found, the empty set is > returned. > """ > > As far as generating pages based on the Message-ID or global id, I > agree with Stephen's proposal. A page returned in response to a > message-id request could return the message page or it could return > an index of such messages. It would be up to the archive whether it > would accept duplicate Message-IDs or not, but it would always be > guaranteed that a page returned in response to a global id request > would return one email message. > > Urls could be calculated by concatenating the List-Archive and X- > Global-Message-ID headers, e.g. > > http://mail.python.org/pipermail/mailman-developers/ > RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI > > would be the OP. This could point to the same resource as > > http://mail.python.org/pipermail/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI > http://mail.python.org/pipermail/global/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI > > and /might/ point to the same resource as: > > http://mail.python.org/pipermail/mailman-developers/ > 87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp > http://mail.python.org/pipermail/mids/ > 87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp > > > A minor drawback to my proposal is that if a message gets archived as > > a singleton for that Message-ID, then a duplicate arrives, previously > > created references in the archive will of course now return an index > > rather than the desired message. Ie, there is data corruption. This > > can be dealt with in several ways; the easiest would be to provide a > > "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking- > > for-me" > > link when creating the directory for multiple instances. > > Or by using the global id, or by rejecting messages with duplicate > message ids. > > > There's also a *very* minor benefit: repeat sends will be immediately > > recognizable without checking Message-ID. > > > > Footnotes: > > [1] By partly human-readable I mean containing list-id and date > > information. The idea would be to have the date come first, so that > > users would have a shot at identifying which of several messages is > > most likely, and this would be searchable by eye with simply an > > ordinary sorted index. > > I see searching, indexing, sorting, and providing other human > readable urls into the message store as a function of the archive. > Once you're looking at a link to the actual message, you're going to > be looking at a url that contains the global id, regardless of the > number of levels you have to go through or redirects involved. > > Apologies for letting this thread linger so long. I'm very > interesting in hearing your thoughts and if there's general > agreement, I'll write it up in the wiki. > > - -Barry > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.7 (Darwin) > > iQCVAwUBRqCkWnEjvBPtnXfVAQIRhAP7BkuF5K0xOuie2GBqOOWDarksD5Oy49y9 > /2WO+u4xH+BttIt3adHJS+K6ETYcK79c5Rf4uwZk40DqWKK7ay1zkxUn/LGXOJ0o > CoWQG5ZyFUJUTkDXtxEWcZ8kkXaDTTSNz2eCtYgQAXw77A95E1SjV0YBs54bFK3A > Bi9cjrKRDcM= > =pyY6 > -----END PGP SIGNATURE----- > _______________________________________________ > Mailman-Developers mailing list > Mailman-Developers at python.org > http://mail.python.org/mailman/listinfo/mailman-developers > Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py > Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ > Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org > > Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp > From stephen at xemacs.org Tue Jul 24 08:56:35 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 24 Jul 2007 15:56:35 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> Message-ID: <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> Jeff Breidenbach writes: > > Notice that of 325146 total messages, 624 of them had no message-id > > header. Even if you aggregate dup+col, you're still looking at a > > total duplicate rate of 0.29%. > > Message ID's are supposed to be unique. Fortunately, a rule more honored in the observance than the breach. Nonetheless, it *is* breached. The Postel Principle applies here, IMO. > better to go ahead and use the mesage-id, rather than concoct > yet another "this time we mean it!" unique identifier. That's not the point. We're not going to impose this on senders; that's what Message-ID is for, as you say. If a sender won't provide a proper Message-ID, third parties who get a CC are just out of luck. I simply think we should be prepared for applications where relying on the sender to supply a UUID is not acceptable; we need to be able to provide one ourselves. Creating UUIDs is a solved problem, after all. So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL. Then we say that an archive SHOULD provide access to the resource via Message-ID if available, and define how to construct that URL from the List-Archive and Message-ID headers. > Which brings me to suggestion #2, which is go ahead and write > an RFC on how list servers should embed archival links in messages. I think Barry already suggested that? Anyway, +1. But remember, a standards-track RFC should have a working implementation to point to. From jam at jamux.com Tue Jul 24 13:10:55 2007 From: jam at jamux.com (John A. Martin) Date: Tue, 24 Jul 2007 07:10:55 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> (Stephen J. Turnbull's message of "Tue, 24 Jul 2007 15:56:35 +0900") References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <877ioqys3k.fsf@athene.jamux.com> A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 154 bytes Desc: not available Url : http://mail.python.org/pipermail/mailman-developers/attachments/20070724/a44ea72c/attachment.pgp From stephen at xemacs.org Tue Jul 24 13:55:26 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Tue, 24 Jul 2007 20:55:26 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <877ioqys3k.fsf@athene.jamux.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> Message-ID: <87tzrukocx.fsf@uwakimon.sk.tsukuba.ac.jp> John A. Martin writes: > >> better to go ahead and use the mesage-id, rather than concoct > >> yet another "this time we mean it!" unique identifier. > > st> That's not the point. We're not going to impose this on > st> senders; > > I read the quote as meaning "this time we mean it really is unique", > imposing nothing on senders. Ah. If so, my reply is "if you want something done right, do it yourself." *All robust databases assign a unique ID to each record.* Why shouldn't a mailing list archive do so? > Right. Maybe that will encourage compliance. The complexity of > catering to brokenness in this instance may be too high a price to > impose on the all. What complexity? Mailman just does msg['X-List-Archive-Received-ID'] = Email.msgid() (or however the message ID generator is spelled). After that, it's up to the archiver whether to do anything with it or not. I proposed a way that it could be used; if that's considered too complex, fine. But simply assigning one is not complex or otherwise very costly. From jeff at jab.org Tue Jul 24 18:31:55 2007 From: jeff at jab.org (Jeff Breidenbach) Date: Tue, 24 Jul 2007 09:31:55 -0700 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <877ioqys3k.fsf@athene.jamux.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> Message-ID: There are three different parties coming to the table. One is the mail transfer agent of the sender, another is the list server, and the third is the archive server. Ideally, all three will be happy campers. >So we just specify a header to put it in, and subscribers will be able >to use it, per definition of a canonical URL. It is the archive server's job to decide what is the "canonical" URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :) >What complexity? Mailman just does > > msg['X-List-Archive-Received-ID'] = Email.msgid() Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id and the x-list-archive-received-id. That's two namespaces that almost do the same thing. It's easier for the archive server to keep track of one name space than two, and - most importantly - conceptually simpler. >From the perspective of the assorted list servers, it's easier to do nothing than to do something. So if they can get by with just message-id (which is already implemented) not have to add x-list-archive-received-id, that's a smoother implementation path. If we base on message-id, archival servers will be able to retroactively add support for all their stored messages, even those that are ten years old. And users holding an old message will be able to figure out that URL without doing any computational gymnastics. Put another way, there's the possibility to reduce the archive servers' implementation to "search for this mesage-id" which is something really useful to have anyway, and therefore likely to get wider support. In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Very painful if the archive is showing most recent message first. Therefore an archival server is very likely to determine message date from the most recent received header (generally from a trusted mail transfer agent) rather than the date field. From the archive server's perspective, the best thing to do with the date field is throw it away. So for these reasons, I'd rather stick with message-id and risk some real world collisions, instead of introduce another identifier. If the list server receives a message with no message-id, by all means create one on the spot. To me, this feels like the sweet spot in terms of cost benefit. The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message. Jeff From Dale at Newfield.org Tue Jul 24 18:43:29 2007 From: Dale at Newfield.org (Dale Newfield) Date: Tue, 24 Jul 2007 12:43:29 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> Message-ID: <46A62C31.5010501@Newfield.org> Jeff Breidenbach wrote: > In addition, Barry was talking about concocting a unique > identifier from the Date field and Message-ID. I'm not a big fan of > this idea, because the date field comes from the mail user agent > and is often wildly corrupt; e;g; coming from 100 years in the future. Oh--I was assuming the Date to which he was referring was the current timestamp at which mailman was processing the message. I was going to say that this guarantees uniqueness, but I guess there are parallel mailman implementations where more than one machine/processor are all serving the same list, and then two different machines/processors might wind up with identical timestamps while processing two different messages. -Dale From terri at zone12.com Tue Jul 24 19:11:17 2007 From: terri at zone12.com (Terri Oda) Date: Tue, 24 Jul 2007 13:11:17 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> Message-ID: <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com> On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote: >> So we just specify a header to put it in, and subscribers will be >> able >> to use it, per definition of a canonical URL. > It is the archive server's job to decide what is the "canonical" URL > for a message. There's a good chance these archival URLs will be > served by an HTTP redirect. So let's not use the word canonical. :) Someone already pointed out that the message ID is a bit long for a URL, so I'm guessing we're going to want some sort of shorter sequence number for messages for linking purposes. Regardless of whether we *need* to generate our own unique ID, I'm leaning towards the thought that we're going to *want* to generate our own for usability reasons. In a perfect world, i think we'd have a sequence number so I could visit http://example.com/mailman/ archives/listname/204.html and know that 205.html would be the next message to that list, but any short unique id would do if sequence numbers are too much of a pain. It seems silly to generate nice short links but then use message-id. If we can generate nice short links, we might as well use 'em throughout, unless you really think the default use of the archive will be to search it by messageid (which I sincerely doubt, from my user experiences). Terri From jeff at jab.org Tue Jul 24 20:03:09 2007 From: jeff at jab.org (Jeff Breidenbach) Date: Tue, 24 Jul 2007 11:03:09 -0700 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com> Message-ID: > Regardless of whether we *need* to generate our own unique ID, I'm > leaning towards the thought that we're going to *want* to generate > our own for usability reasons. In a perfect world, i think we'd have > a sequence number so I could visit http://example.com/mailman/ > archives/listname/204.html and know that 205.html would be the next > message to that list, but any short unique id would do if sequence > numbers are too much of a pain. I agree there's a lot of usability benefits from short URLs, but perhaps this is the job of the archive server, and not the list server. Mharc (an archive server) is a great example here. Mharc's canonical message format is pretty human friendly. http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg00000.html Unfortunately, there's no trivial way for the list server to know that human friendly URL when the message is sent out. Fortunately, Mharc is also happy handles messages by message-id, which the list server does know about. http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc-users&i=200208010532.g715W0e31774 at gator.earlhood.com Had I been the implementer, I'd probably have made mharc do an HTTP 302 redirect from the longer URL to the shorter URL. But that's besides the point. The point is we have an existing, working, happy archival server, and it would be really nice if list servers (such as mailman) were compatible. And by compatible, I mean offering the capability of embedding an archival URL in the footers of messages. -Jeff From barry at python.org Tue Jul 24 20:41:38 2007 From: barry at python.org (Barry Warsaw) Date: Tue, 24 Jul 2007 14:41:38 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <1183651769.10813.6.camel@finch.boston.redhat.com> <74B7C2E0-C11A-4A36-8DA0-70219A026E3B@zone12.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 22, 2007, at 12:33 PM, Terri Oda wrote: > On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote: >> I've looked at a few lurker archivers and I wasn't blown away by its >> user interface. That's apparently highly configurable though. > > I've been doing a lot of thinking about interface, and I'm coming to > the conclusion that something more like a web bulletin board is > probably the way to go, given that people use them all the time > without much trouble and with a fairly minimal amount of whining. ;) I like this for several reasons. I've long wanted a bridge between the traditional mailing list and a forum because to me they're related along a spectrum of emotional investment. What I mean is this. For the subjects and projects I care deeply about, I join the mailing list. I want to be intimately involved in the day-to-day collaboration that being subscribed gives me. I care enough about that that I'm willing to put up with the pain that comes along with mailing lists, such as the overhead for subscribing, deleting topics I don't care about, the occasional spam, the overhead of going on vacation or leaving the list, etc. But there are even more topics or projects that I have only a fleeting interest in. Say I find a bug in some X program, or wake up and decide to learn how to use setuptools, or find that some recent update broke my Linux server. In all those cases, I might want to start a thread of discussion or ask a question, and be very involved in that thread for a week or two. Then, my interest wanes, or I get my question answered, or other projects pique my interest. Mailing lists are pretty bad at managing those kinds of fleeting involvement, but forums are quite nice. There's usually fairly low overhead (and probably even less if OpenID and such were in widespread adoption) for joining, and when I lose interest the forum doesn't fill up my inbox. OTOH, forums seem good for short 'instant' messages, but not so good (IMO) for free ranging, detailed discussions. So there's a spectrum. > I'm trying to use interfaces to things like comment systems (which > are often threaded -- picture the slashdot stuff, maybe?) and popular > boards like phpbb (which isn't threaded beyond separate topics) as > guides to how people usually deal with conversations on the web. > > It'd actually be fairly easy, at that point, to just put a posting > interface into the archives (yes, you'd have to be logged in, and > yes, this means your password becomes that bit more valuable because > someone having it can pose as you to the list... but they could do > that by spoofing your email address so I'm not too concerned). But > then people who don't like email or just want to pop by and check the > list quickly could actually use mailman like a web board, which is > something I'm pretty sure would get used (I know my users have asked > for it in the past). Heck, /I'd/ use it, so what more justification do we need? :) > I've been drafting simple prototype interfaces in my head, trying to > keep potential architectures in mind. I'm hoping I'll have time this > week to code some up HTML and see how well they actually work when > they're not just inside my head. :) I'd love to see the prototypes once you've committed them to HTML. The one important thing is that the individual postings will need the equivalent of a stable archive URL (i.e. permlink) that could be passed around, added to web pages, etc. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZH43EjvBPtnXfVAQLrzQP8CG5ALhX+Wk91I+jri20R60C7cqtCzQby V9MD8FlhC/7LbRW3QXwJnwWSpXCnBYhShxmRMn2maEeIXqPUEBl3QOcUYkHxeRZG zV6sKE1J1EZfbUTY7CM3lcnOZKHB1n07PGslcxQsJHEmnbuHbR7bm+2AV2CknzZj 8Y/9XxPjX5Q= =IRq2 -----END PGP SIGNATURE----- From barry at python.org Tue Jul 24 20:44:27 2007 From: barry at python.org (Barry Warsaw) Date: Tue, 24 Jul 2007 14:44:27 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote: > Which brings me to suggestion #2, which is go ahead and write > an RFC on how list servers should embed archival links in messages. > This sounds like an internet wide interoperability issue as much as > something mailman specific. Why not come up with a scheme usable > by all list servers? And also describe a specification third party > archival > services can comply to. Besides, I've always wanted to help write > an RFC. If we go that route, it would be good to get input from a > range > of people - one person I'd suggest is Earl Hood, author of mhonarc. I've always thought that an RFC-like spec that describes how a generic mailing list manager would interoperate with a generic archiving service is the way to go. I've written up a somewhat more formal spec of what I've implemented MM3 currently here: http://wiki.list.org/display/DEV/Stable+URLs If this looks good, I'd be happy to approach some of the related communities to try to get buy-in. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZIjHEjvBPtnXfVAQLK9AP/VQveYtFuZhJam9TITYBuMyc8pig7nqDt efn4DIXhZhgtqBQ58/TgEFZnTkKfiZ1HLdoovrQye8HdKZmuAd+SJrOkq/aO9fIC ZgaV5HYBD7TcnQuO2z5eRuK3IY7FpWoeZrn/a6sxBObsaSOrOTjhqs1gv5go24d3 8CmG/bB9LTo= =EyoU -----END PGP SIGNATURE----- From barry at python.org Tue Jul 24 20:53:25 2007 From: barry at python.org (Barry Warsaw) Date: Tue, 24 Jul 2007 14:53:25 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <419AFBBF-82FF-4939-B85B-85055A1B8482@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote: > I simply think we should be prepared for applications where relying on > the sender to supply a UUID is not acceptable; we need to be able to > provide one ourselves. Creating UUIDs is a solved problem, after all. > So we just specify a header to put it in, and subscribers will be able > to use it, per definition of a canonical URL. > > Then we say that an archive SHOULD provide access to the resource via > Message-ID if available, and define how to construct that URL from the > List-Archive and Message-ID headers. I think there's two approaches we could argue for. One is for the mailing list manager to craft a UUID out of whole cloth and stick that in a header. Then any downstream archiver would be obliged to use that header value as the canonical address of the message, with an alternative path to the message via the Message-ID (possibly returning a list of matching messages when there are collisions). The second approach, and the one that I favor, is to use the Message- ID (and the Date) header on the original message as the UUID, properly handling corner cases like duplicate headers or missing header. This UUID servers as the basis for the address to the message resource just like above. I like the second approach better because in the case where you start with an off-list copy of the message, you have a decent enough chance of getting to the archived message, or at least to a resource containing a link to the message. The first alternative would require access to the list copy. Imagine if every archiver supported my proposal, knowing just the Message-ID and Date header, you could get to that message from almost anywhere, just by using the UUID as a relative URL rooted at say http://www.mail-archive.com, http://groups.google.com, http:// mail.python.org/pipermail, or whatever. That would be pretty neat. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZKpnEjvBPtnXfVAQJWcwP6A6SqHTeft+c/5IeSpRsI+gvtPJW94fcG pjB66oYiKco7U+rZtxll3TPD9Ta7gccohq72sh8hV7CHRW7Cd531Hq91z7QktHUW zqzxkMimoca7WlUxr0/ElyPNhRkjMlR8LvhNCjs4a9O6/PpzBTNjsXwaTKfLrqO3 N5iq3BWoMK8= =fSNC -----END PGP SIGNATURE----- From barry at python.org Tue Jul 24 21:11:27 2007 From: barry at python.org (Barry Warsaw) Date: Tue, 24 Jul 2007 15:11:27 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote: >> What complexity? Mailman just does >> >> msg['X-List-Archive-Received-ID'] = Email.msgid() > > Easy to introduce, harder to deal with. The archival server would now > keep track of both the message-id and the x-list-archive-received-id. > That's two namespaces that almost do the same thing. It's easier > for the archive server to keep track of one name space than two, > and - most importantly - conceptually simpler. True, but an archiver already has to handle collisions on the Message- ID so in a sense, you have to maintain multiple paths to the same message, don't you? So I like my proposal because it imposing nothing additional on the MUA or MTA, a tiny bit more on the MLM, and some extra work (though I think not much) on the archiving agent. What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy, and human friendlier urls. >> From the perspective of the assorted list servers, it's easier to > do nothing than to do something. So if they can get by with > just message-id (which is already implemented) not have to add > x-list-archive-received-id, that's a smoother implementation path. > If we base on message-id, archival servers will be able to > retroactively add support for all their stored messages, even those > that are ten years old. And users holding an old message will be > able to figure out that URL without doing any computational > gymnastics. All these are still true with my proposal, except with the observation as Stephen points out that given a URL based on sender- provided headers, you must be prepared to deal with collisions, so sometimes your resources will return lists. The advantage of adding a bit of MLM-provided information is that given the list copy you can guarantee uniqueness, and given the off-list copy you can get to a resource that contains a link to the message you want. > Put another way, there's the possibility to reduce the archive > servers' implementation to "search for this mesage-id" which is > something really useful to have anyway, and therefore likely to > get wider support. > > In addition, Barry was talking about concocting a unique > identifier from the Date field and Message-ID. I'm not a big fan of > this idea, because the date field comes from the mail user agent > and is often wildly corrupt; e;g; coming from 100 years in the future. > Very painful if the archive is showing most recent message first. > Therefore an archival server is very likely to determine message date > from the most recent received header (generally from a trusted mail > transfer agent) rather than the date field. From the archive server's > perspective, the best thing to do with the date field is throw it > away. Throw it away or hide it? The former would be a problem, but not the latter. Does your archiver keep a canonical copy of the message as you received it? If so, then you preserve the original Date header enough for the calculation to occur, even if you hide the Date header, or display a Received header date when you render it to HTML. That doesn't matter of course. But I should point out that I'm not married to including the Date header in the hash. I like it because it appears to reduce collisions which I care about. But I still like using the base32 sha1 hash instead of the raw Message-ID because I think it's easier for humans to use, read, speak, and copy. Of course this doesn't mean that you need to disable your search-by-Message-ID feature! > So for these reasons, I'd rather stick with message-id and risk > some real world collisions, instead of introduce another identifier. > If the list server receives a message with no message-id, by all means > create one on the spot. To me, this feels like the sweet spot in > terms > of cost benefit. The main thing that bugs me is message-ids are long, > which makes them awkward to embed in a URL in the footer of a > message. Another advantage for the URL scheme I propose. You know you're going to end up with URLs of len(host-prefix) + 32 + 1 + #digits-in- seqno (32 == base32(sha1digest(data)) (1 == / divider) (#digits-in-seqno == e.g. len(str(seqno)) You should be able to keep things in the 60-70 character range, including the host name. That doesn't seem too bad. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqZO4HEjvBPtnXfVAQIYGwP/VZPCiQrg9CTeMThApNTh7xUismbW0AiT 1N6a8DusXDBrqiLDQd+v2/R5KOV+TnwDNlIcl5FfFatHxWJ0bGy850kT/nhrHdKU UrW0hR8PWSMIRN5Bqx9bL9cvaMigAoyX+njAfiDgl0yy7arbAm66GH1HNH3c1XGT 1/qaGckINUg= =4uwH -----END PGP SIGNATURE----- From turnbull at sk.tsukuba.ac.jp Wed Jul 25 05:04:23 2007 From: turnbull at sk.tsukuba.ac.jp (Stephen J. Turnbull) Date: Wed, 25 Jul 2007 12:04:23 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> Message-ID: <87hcntkwug.fsf@uwakimon.sk.tsukuba.ac.jp> Jeff Breidenbach writes: > >So we just specify a header to put it in, and subscribers will be able > >to use it, per definition of a canonical URL. > > It is the archive server's job to decide what is the "canonical" URL > for a message. There's a good chance these archival URLs will be > served by an HTTP redirect. So let's not use the word canonical. :) If it's not going to be "canonical" (I forget if there's a standard for that word :), what is the point in writing an RFC? > >What complexity? Mailman just does > > > > msg['X-List-Archive-Received-ID'] = Email.msgid() > > Easy to introduce, harder to deal with. The archival server would now > keep track of both the message-id and the x-list-archive-received-id. > That's two namespaces that almost do the same thing. The implementations are similar, and there is "nearly" a one-to-one correspondence. But the semantics are very different. Message-ID is untrustworthy, the internal ID is trustworthy. > So for these reasons, I'd rather stick with message-id and risk > some real world collisions, instead of introduce another identifier. Go ahead and stick with message-id if *you* like, but please don't tell *me* what risks I have to accept. There needs to be a way to *enforce* uniqueness, and it *must* be specified by the RFC in order for archive implementations to be interoperable. Note that word "specify"; I do not insist that this level of robustness be *required*. But if we don't specify it now, people who want such robustness will have to do all this work again, and possibly will end up with something that some servers conforming to "your" RFC will not conform to. It is possible that most archivers will simply use the message ID, and do something brutal in the rare case of a collision. That's fine. But an archiver that wants to provide a canonical URL which is guaranteed to uniquely and losslessly identify a post in its archive should have a standard way to do that. > The main thing that bugs me is message-ids are long, which makes > them awkward to embed in a URL in the footer of a message. The footer URL is of no concern in this discussion. There is not going to be a requirement that footer URLs be "canonical", not if I have any say in the matter. The "canonical" URL will be in (or be constructed from) the message header. From jeff at jab.org Wed Jul 25 06:47:23 2007 From: jeff at jab.org (Jeff Breidenbach) Date: Tue, 24 Jul 2007 21:47:23 -0700 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> Message-ID: > What you gain from my proposal over a pure Message-ID approach > is guaranteed uniqueness given the list copy Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies. Sometimes the channel between the MLM and the archive server will be SMTP, and spurious messages can be injected. Finally, from the archive server's perspective, some of the MLMs might make mistakes - just like from the MLM's perspective, some of MTAs might make mistakes in setting message-id. So I don't think the proposed SHA1(date, message-id) scheme buys a hard guarantee of uniqueness. Every component has to protect themselves, but none can solve the world's problems. So that moves us to how many collisions are reduced in practice. I have a question about the numbers Barry mined from the python lists. Are the collisions really that high? One should not count messages without a message-id, because the MLM can and should create one in that case. One should also not count collisions of messages going to different lists. Here's why. Let's say message M is cross posted to lists L1 and L2. Even though it is the same message, there are now two different contexts. (For example, people visit M at archive L1 should get a completely different experience if they hit "next message" and people visiting M at archive L2.) So I'd be curious what the collision numbers come to with these two factors taken into account. The other takeaway is list name really should be part of the URL to get proper context. The earlier example from Mharc does this. > and human friendlier urls. That's a very compelling point. SHA1 can't be computed inside someone's head or simple cut-n-pasted together for old messages, but I think the usability benefits of short URLs (short enough that they can comfortably fit inside message bodies) outweighs this drawback. By the way, is SHA-1 still in favor? My impression was it was fading away after the Shandong University team partially cracked it. > Throw it away or hide [Date]? The former would be a problem, > but not the latter. Thrown away. My favorite archival service is based on mhonarc, and raw mail goes into offline cold storage. Of course this can be changed for the future messages with some pain, but there's no reasonable way for myself (or any other mhonarc users in the same predicament) to retrofit against Date based URLs. For the record, here's what mhonarc embeds in each HTML page it produces because these were considered the important headers. In this message sent from Australia, the date shows a timezone of UTC -0700, because it was pulled from the received header. So my main request is to double check the numbers, see if using "Date" really buys as much as one thinks. I'll keep digesting the other aspects of the wiki page. From barry at python.org Wed Jul 25 15:06:32 2007 From: barry at python.org (Barry Warsaw) Date: Wed, 25 Jul 2007 09:06:32 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 24, 2007, at 1:11 PM, Terri Oda wrote: > On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote: >>> So we just specify a header to put it in, and subscribers will be >>> able >>> to use it, per definition of a canonical URL. >> It is the archive server's job to decide what is the "canonical" URL >> for a message. There's a good chance these archival URLs will be >> served by an HTTP redirect. So let's not use the word canonical. :) > > Someone already pointed out that the message ID is a bit long for a > URL, so I'm guessing we're going to want some sort of shorter > sequence number for messages for linking purposes. Yes, definitely. What do you think of the base32 examples I have on the wiki page? > Regardless of whether we *need* to generate our own unique ID, I'm > leaning towards the thought that we're going to *want* to generate > our own for usability reasons. In a perfect world, i think we'd have > a sequence number so I could visit http://example.com/mailman/ > archives/listname/204.html and know that 205.html would be the next > message to that list, but any short unique id would do if sequence > numbers are too much of a pain. > > It seems silly to generate nice short links but then use message-id. > If we can generate nice short links, we might as well use 'em > throughout, unless you really think the default use of the archive > will be to search it by messageid (which I sincerely doubt, from my > user experiences). We'd want sequence numbers in the urls if we think people will hand edit them, say in a browser location bar. I'm not sure that's a common enough use case. Pipermail currently uses sequence numbers but there are big problems with that. First, the mbox'ing algorithm wasn't always correct so while sequence numbers were accurate when generating the html archives on the fly, they broke horribly when you try to regenerate them from an mbox file. It's also why we have tools like cleanarch which tries to unbreak earlier mboxing bugs by crufty heuristics. This /might/ be solved by ditching mboxes for maildir or some other canonical raw archiving format (not a bad idea in its own right), but manual surgery on the raw archives could still break it. Sometimes site admins just /have/ to remove messages, disrupting the sequencing. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqdK2XEjvBPtnXfVAQKfDQP/ToPZ3t7+uIyMrsThOr+PVQ7aKVT/BQ7F OgKqFSDSma4ZofQOkPgr4ZFRT1yKRURWas7jI2zQ8ADPAOKCYh0Udgq6XjpOI8mI 7/pODazVkbwzT9Oo06pGwpzaONK4eZjt1y9IDb9VkniUcAyve5EQ+5+KaG3rbo4M wsrCnHLkvSE= =/z/f -----END PGP SIGNATURE----- From barry at python.org Wed Jul 25 15:10:37 2007 From: barry at python.org (Barry Warsaw) Date: Wed, 25 Jul 2007 09:10:37 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com> Message-ID: <128BEC04-DFA5-4D9E-B813-8091FE3DEE94@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 24, 2007, at 2:03 PM, Jeff Breidenbach wrote: >> Regardless of whether we *need* to generate our own unique ID, I'm >> leaning towards the thought that we're going to *want* to generate >> our own for usability reasons. In a perfect world, i think we'd have >> a sequence number so I could visit http://example.com/mailman/ >> archives/listname/204.html and know that 205.html would be the next >> message to that list, but any short unique id would do if sequence >> numbers are too much of a pain. > > I agree there's a lot of usability benefits from short URLs, but > perhaps > this is the job of the archive server, and not the list server. > Mharc (an > archive server) is a great example here. Mharc's canonical message > format is pretty human friendly. > > http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg00000.html > > Unfortunately, there's no trivial way for the list server to know > that human > friendly URL when the message is sent out. Fortunately, Mharc is also > happy handles messages by message-id, which the list server does know > about. > > http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc- > users&i=200208010532.g715W0e31774 at gator.earlhood.com > > Had I been the implementer, I'd probably have made mharc do an HTTP > 302 > redirect from the longer URL to the shorter URL. But that's besides > the point. > The point is we have an existing, working, happy archival server, > and it would > be really nice if list servers (such as mailman) were compatible. > And by > compatible, I mean offering the capability of embedding an archival > URL in the > footers of messages. I agree, I just don't think message-ids are user friendly enough to be this canonical url. Especially in this context, which is exactly where urls are thrown in users faces. An archiving service is exactly the right place for redirecting human readable urls to the archiver's canonical url (by, I agree, 302). - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqdLznEjvBPtnXfVAQJtxgQAiLp7TjnLoOLnpoxfli2gBo6fdU6ZIFb0 SKiuRgLAoTSdnJymYWOww2U/vTJ3HqR2dZNFCfGeVHgzoHpiX87WiZDJ4Sx1Jec8 7BpIO1ZokGI2NhHiSscYC5k4iCzce17lVGkyVzfYlFysmFKsFjcDIpV8wQFleeG9 TneLaMXT2eY= =1tKI -----END PGP SIGNATURE----- From barry at python.org Wed Jul 25 15:17:13 2007 From: barry at python.org (Barry Warsaw) Date: Wed, 25 Jul 2007 09:17:13 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <87hcntkwug.fsf@uwakimon.sk.tsukuba.ac.jp> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> <87hcntkwug.fsf@uwakimon.sk.tsukuba.ac.jp> Message-ID: <8530573B-F935-43A8-AD22-CAEC776807D4@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 24, 2007, at 11:04 PM, Stephen J. Turnbull wrote: >>> So we just specify a header to put it in, and subscribers will be >>> able >>> to use it, per definition of a canonical URL. >> >> It is the archive server's job to decide what is the "canonical" URL >> for a message. There's a good chance these archival URLs will be >> served by an HTTP redirect. So let's not use the word canonical. :) > > If it's not going to be "canonical" (I forget if there's a standard > for that word :), what is the point in writing an RFC? I completely agree. Maybe "interoperable" is the right word to use. Or "user friendly interoperable archive url" which is really what we're trying to define here (IMO). > There needs to be a way to *enforce* uniqueness, and it *must* be > specified by the RFC in order for archive implementations to be > interoperable. Note that word "specify"; I do not insist that this > level of robustness be *required*. But if we don't specify it now, > people who want such robustness will have to do all this work again, > and possibly will end up with something that some servers conforming > to "your" RFC will not conform to. Yep. > It is possible that most archivers will simply use the message ID, and > do something brutal in the rare case of a collision. That's fine. > But an archiver that wants to provide a canonical URL which is > guaranteed to uniquely and losslessly identify a post in its archive > should have a standard way to do that. Yep. >> The main thing that bugs me is message-ids are long, which makes >> them awkward to embed in a URL in the footer of a message. > > The footer URL is of no concern in this discussion. There is not > going to be a requirement that footer URLs be "canonical", not if I > have any say in the matter. The "canonical" URL will be in (or be > constructed from) the message header. Agreed in the sense that the RFC 2822 headers must contain all the information necessary to construct the canonical url (or must contain the canonical url). A list server /can/ decorate the message with the url in other ways, but that certainly isn't necessary. You might even imagine a mail reader extension that read the appropriate List-* headers and added a button "View In Archive" which sent the canonical url to your web browser. Once that happens, the archive service is free to redirect to its hearts content. I submit though that any good archive service (and certainly Pipermail++ if I can help it) will ensure that those urls are stable forever, otherwise people will stop relying on it. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqdNWnEjvBPtnXfVAQIZRAP/Ux9rUK6ToH5Zl2XTC8LOKgCG+1yhf4pw h4XVZc0nmP1xxFttsXzsuY+/oGFW8yrY0yGnxK4N5EKUEpIxejGNbVtAjpQ5l/Sy ml5R5kDhZtk/d8tE9IXOzB5zCcxdmMgjX3KfL78t5L6JzAQ4RgM0MTYxPH69AdHW zpvhBCow/z8= =KiqU -----END PGP SIGNATURE----- From barry at python.org Wed Jul 25 15:34:13 2007 From: barry at python.org (Barry Warsaw) Date: Wed, 25 Jul 2007 09:34:13 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> Message-ID: <2009EA3C-9E11-4B2A-BF57-A62C0EB11870@python.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote: >> What you gain from my proposal over a pure Message-ID approach >> is guaranteed uniqueness given the list copy > > Guarantee is a pretty strong word. A malicious person could post two > messages with the same message-id, same date, but different bodies. No question, if the archive service and the list server are not intimately connected, the communication channel between the two can be subverted. There are ways that channel's trust could be enhanced though, for example by the list server signing its headers in a dkim- like fashion. But in situations where the two are co-located, you can trust these headers even without that enhancement. > So that moves us to how many collisions are reduced in practice. > I have a question about the numbers Barry mined from the python > lists. Are the collisions really that high? One should not count > messages without a message-id, because the MLM can and should > create one in that case. I've uploaded the script I used to here: http://wiki.list.org/download/attachments/786633/scan.py?version=1 It's probably not perfect, and certainly the python.org mbox's may not be representative enough of the real world. Please grab the script, tweak it and run it over your own raw archives; it should be easily modified to handle any of the mailbox formats supported by Python 2.5's mailbox module. If you improve the script or find numbers that lead to different conclusions, now's the time to know! >> and human friendlier urls. > > That's a very compelling point. > > SHA1 can't be computed inside someone's head or simple cut-n-pasted > together for old messages, but I think the usability benefits of > short > URLs (short enough that they can comfortably fit inside message > bodies) > outweighs this drawback. By the way, is SHA-1 still in favor? My > impression was it was fading away after the Shandong University team > partially cracked it. We're not concerned with the cryptographic security claims of SHA1. I don't see any economically beneficial attack on the archives against SHA1 here. I think SHA1 is reasonably universally available, and marginally better than MD5, so it's probably good enough for this application. You're right that no one is going to do SHA1 in their heads, and if they could, they're probably working for some TLA in a secret gubmit basement lab somewhere. The point of course is that a /program/ could easily apply the algorithm to a very minimal existing message and come up with the same canonical url. This enables all kinds of cool applications based on REST-y principles or whatever. The fact that the algorithm leads to short(ish), largely unambiguous (to humans), readable urls is an important benefit -- probably /the/ most important benefit. >> Throw it away or hide [Date]? The former would be a problem, >> but not the latter. > > Thrown away. Really? Wow. I'd have thought every archiving service would want to keep a record of the raw message it received on the wire. That would allow it to regenerate the html archive if necessary, provide useful forensics, and allow for exactly the kind of data mining we're doing here. I can't see /any/ reason for not saving the raw messages in their entirety, especially for a public list. Maybe for a private one, where your data retention policies require you delete things after a certain amount of time, but even there, I can't see why you'd want to trim raw messages rather than just chucking them entirely. > My favorite archival service is based on mhonarc, > and raw mail goes into offline cold storage. What's the advantage of that? Isn't disk space cheap as dirt? Probably cheaper if you've bought any topsoil recently :). Still, the raw messages are still available right? So if there was enough value in calculating the canonical urls so that the archive service could be seen as an interoperability good citizen, then it could be done. I'll just reiterate that I'm not married to including the Date header in the algorithm. Until proven otherwise by more research, I think it's a good idea to use because 1) it's required by RFC 2822 and 2) it seems to reduce collisions. I think the algorithm I propose would work just as well with Message-IDs alone, although there's more of a chance that the non-sequence numbered url will return multiple matches. - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBRqdRVnEjvBPtnXfVAQJiOgP/UIufdisvgVPV3qKo4dV2bfWoUPcp/dIQ iGj9faWXFwa/NoOk3HtIZbu7JVrJEY2t9nihJX6lEjZ1Q6AFH1hkObx0dV5NRfj2 KjRANxU6UsBvpDCzBQWthX1d7HviRJ74Pio5hVti+0YoV4pjq8UHaxTlrECHmkad ERlOYR2onAQ= =8b8I -----END PGP SIGNATURE----- From jfesler at gigo.com Wed Jul 25 16:29:02 2007 From: jfesler at gigo.com (Jason Fesler) Date: Wed, 25 Jul 2007 07:29:02 -0700 (PDT) Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> Message-ID: > Guarantee is a pretty strong word. A malicious person could post two > messages with the same message-id, same date, but different bodies. This is my concern too. Especially since this is known information; it is trivial to be malicious. Whatever was done, I think would *have* to deal with 'dupes', in some form or another. From gustav at gcis.gov.za Wed Jul 25 11:30:03 2007 From: gustav at gcis.gov.za (Gustav H Meyer) Date: Wed, 25 Jul 2007 11:30:03 +0200 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <46A62C31.5010501@Newfield.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> <46A62C31.5010501@Newfield.org> Message-ID: <46A7181B.60406@gcis.gov.za> Hi, I think this is the first time that I'm posting here but hopefully not the last. Thanks to everyone involved for an incredible project. I'm not much of a developer but I like practical solutions and will do everything possible to help improve in this area even if it's just to give some feedback. I'm very excited about this project and can't wait for the next version to come out with full integration between web forum and mailing list. I like this idea very much and it seems that we're going to see it real soon. :) On 24/07/2007 18:43, Dale Newfield wrote: > Jeff Breidenbach wrote: >> In addition, Barry was talking about concocting a unique >> identifier from the Date field and Message-ID. I'm not a big fan of >> this idea, because the date field comes from the mail user agent >> and is often wildly corrupt; e;g; coming from 100 years in the future. > > Oh--I was assuming the Date to which he was referring was the current > timestamp at which mailman was processing the message. I was going to > say that this guarantees uniqueness, but I guess there are parallel > mailman implementations where more than one machine/processor are all > serving the same list, and then two different machines/processors might > wind up with identical timestamps while processing two different messages. I also like the idea of seeing the date somewhere in the URL but IMHO we also need to see a unique sequential number. How about the following idea: http://my.list.server/archivebase/mylist/200707240001/msg00001/ http://my.list.server/archivebase/mylist/200707250001/msg00002/ http://my.list.server/archivebase/mylist/200707250002/msg00003/ and at the same time allow the following: http://my.list.server/archivebase/mylist/msg00001/ http://my.list.server/archivebase/mylist/msg00002/ http://my.list.server/archivebase/mylist/msg00003/ This way you can see exactly how many messages were sent on a day and how many messages have been sent since the start. BTW the sequential number does in my view not have to be a decimal value. Anything short and sweet will do as long as you can work it out and at the same time allow for almost unlimited growth. Just an idea. Regards, Gustav H Meyer From stephen at xemacs.org Wed Jul 25 18:40:08 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 26 Jul 2007 01:40:08 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <128BEC04-DFA5-4D9E-B813-8091FE3DEE94@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com> <128BEC04-DFA5-4D9E-B813-8091FE3DEE94@python.org> Message-ID: <874pjsl9nb.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > I agree, I just don't think message-ids are user friendly enough to > be this canonical url. Especially in this context, which is exactly > where urls are thrown in users faces. An archiving service is > exactly the right place for redirecting human readable urls to the > archiver's canonical url (by, I agree, 302). I'm confused (to be precise, you're confusing me). If human readable URLs are exactly right for redirection to the canonical URL, why does the canonical URL need to be user friendly? A quick remark: the git SCM uses BASE16 SHA1s for object names, but allows you to abbreviate them to the unique prefix. A friendly archive could do the same for your BASE32 ids. Without going much into implementation, here's how I would write the conformance section for our RFC. The point is that I don't see any need to discuss user-friendliness or the implementation of UUIDs for the RFC! This means that getting those right from the start is not that important. 0. Conformance 0.1 List managers A conforming list manager MUST provide the List-Archive header field if the post is being archived. A conforming list manager MAY provide the List-Archive-UUID header field. If so, the value MUST be guaranteed unique, and it MUST be present in the post as provided to the archiver. The contents of this header need not be distinct from the contents of the Message-ID header, as long as the uniqueness guarantee is maintained. 0.2 Archives A conforming archive MUST reserve the namespaces "message-id/" and "list-post-id/" relative to its base URL for the uses described below. A conforming archive MUST support retrieval by Message-ID, using the namespace "message-id/$(MESSAGE-ID)" relative to its base URL. The archive specified in the List-Archive header field MUST support access using the value of that field as its base URL. A conforming archive SHOULD support retrieval by UUID, using the namespace "list-post-id/$(LIST-ARCHIVE-UUID)" relative to its base URL. If the scheme is "http" or "https", a conforming archive that does not support retrieval by UUID SHOULD return status 501 NOT IMPLEMENTED with an entity explaining that retrieval by UUID is not implemented. A conforming archive MAY support "friendlyurls" for use where space is constrained (eg, in a post's footer). A conforming archive may support any other URIs it wants to, too. A third party SHOULD be able to regenerate a friendlyurl from the original message contents. 0.3 Software Conforming archive software SHOULD provide interfaces for generating UUIDs and friendlyurls, if retrieval is supported. Conforming list managers SHOULD use these interfaces. Some comments: The interfaces for generated URLs should be provided as command line utilities as well as callable functions. Although the conformance level for friendlyurl support is "may", I expect that essentially all archives will support friendlyurls. The namespace for UUIDs and friendlyurls should probably be more restricted than "any valid URI". "List manager" denotes any source of archival content (eg, you could imagine a user storing their outbox in a archive, so that the "list manager" would actually be the user's MUA). The namespaces suggested above are good enough, I think, but there may be better ones. Instead of 501 NOT IMPLEMENTED, I considered 410 GONE, but that implies a request to delete the reference. Since this is implemented as a header in the post, the archive could be augmented to support it later. In the phrase "guaranteed unique", "guaranteed" means "to the level provided by uuidgen or standard Message-ID generators". Generation of friendlyurls or unique ids based on message body content is probably a bad idea. From stephen at xemacs.org Wed Jul 25 18:56:45 2007 From: stephen at xemacs.org (Stephen J. Turnbull) Date: Thu, 26 Jul 2007 01:56:45 +0900 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> <57B0148D-D5E6-4EA5-8C93-493BC06FBA86@zone12.com> Message-ID: <873azcl8vm.fsf@uwakimon.sk.tsukuba.ac.jp> Barry Warsaw writes: > Yes, definitely. What do you think of the base32 examples I have on > the wiki page? They're somewhat better than Message-IDs for readability, but they're not user-friendly. > On Jul 24, 2007, at 1:11 PM, Terri Oda wrote: > > > It seems silly to generate nice short links but then use message-id. The use case for the message-id is not people. It's software, which doesn't much care about "nice short". But the developers debugging and maintaining the software will thank us for the ease of verifying that the URL goes to the right place. From jeff at jab.org Thu Jul 26 08:23:55 2007 From: jeff at jab.org (Jeff Breidenbach) Date: Wed, 25 Jul 2007 23:23:55 -0700 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <2009EA3C-9E11-4B2A-BF57-A62C0EB11870@python.org> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> <2009EA3C-9E11-4B2A-BF57-A62C0EB11870@python.org> Message-ID: > If you improve the script or find numbers that lead to different > conclusions, now's the time to know! Live and learn! So I just looked at 2 million raw messages from 2007, spread over a few thousand mailing lists (all data is from mail-archive.com). My first question was - when comparing only with messages from the same list - how many times do I see a repeated message-id? The answer was ... drumroll please ... 260 thousand. What the hell? Time for a closer look. In some cases, the archiver was getting two copies of every message. For example, the MLM (mailman) was sending out a message to subscriber A and subscriber B, and both paths eventually lead to the archiver. In another case, the MLM (YahooGroups) spammed 20 copies of the same message to every subscriber, and modified the body of each one. YahooGroups tends create HTML mail and sticks ads, possibly spyware, and who knows what other crap in message footers. There's probably other categories I haven't noticed yet, 260k messages is a lot of checking. So you'd think the archives would be a complete mess. But they aren't and I had no idea anything was remotely amiss under the hood. That's because mhonarc only archives one message per message-id. So those 19 repeats from YahooGroups get thown away. This is actually a pretty robust strategy when you think about it; it keeps lots of annoyances out of archives and everyone who gets smited deserves it; accidental duplicates, malicious duplicates, broken mail transfer agents. Reasonable people can disagree, but I like it. So I'm amending my request. If mailman and pipermail++ want to keep a verbatim record of everything passing through the MLM, fine. But please make it also possible to interoperate with archivers that use the looser mhonarc strategy, e.g. allow the interoperability URL to collide when message-ids collide. Currently Stephen's proposal allows this, Barry's does not. Just to make things really concrete, here's an example from that YahooGroups collision I was describing. The 20 messages spammed to subscribers would all have a interoperability URL something like this (but perhaps not quite so enormously long) embedded in the message, in both headers and possibly a footer. http://www.mail-archive.com/search?l=estika%40yahoogroups.com&q=3578.125.161.129.196.1175036508.CBNWebMail%40webmail1.cbn.net.id Clicking on it, the user goes to the archive server. For this particular archiver, an HTTP 302 redirect takes the user to another URL which happens to be more human friendly. But the details of what alternate URLs are available - if any - is really up to the archive server. http://www.mail-archive.com/estika at yahoogroups.com/msg01341.html I think that's about it. I do kind of like Stephen's suggestion of allowing the archiver to supply a formuia for interoperability URL; if that's the case I'd say the RFC2369 headers could be fair game for use in the calculation. That allows cross posted messages to easily link to their correct archive - note how I used the contents of List-Post when creating the interoperability URL above. Jeff From Dale at Newfield.org Thu Jul 26 09:37:37 2007 From: Dale at Newfield.org (Dale Newfield) Date: Thu, 26 Jul 2007 03:37:37 -0400 Subject: [Mailman-Developers] Improving the archives In-Reply-To: References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <877ioqys3k.fsf@athene.jamux.com> <2009EA3C-9E11-4B2A-BF57-A62C0EB11870@python.org> Message-ID: <46A84F41.4020003@Newfield.org> Jeff Breidenbach wrote: > So I just looked at 2 million raw messages from 2007, spread over > a few thousand mailing lists (all data is from mail-archive.com). My > first question was - when comparing only with messages from the > same list - how many times do I see a repeated message-id? The > answer was ... drumroll please ... 260 thousand. What the hell? I think the question you were originally going to ask got sidetracked. If we assume that all these "multiple paths from list to archive" duplicates not only share a Message-ID but also a Date (they were the same message originally, so they should!), then both schemes (messageid, and messageid+date) would decide that all (but one of) these messages are redundant. What we really want to know is how many (non-empty) Message-ID collisions are there that *don't* share a Date? This is the number of messages that only-messageid loses, and that the composite identifier method would not lose. -Dale From jeff at jab.org Fri Jul 27 06:56:44 2007 From: jeff at jab.org (Jeff Breidenbach) Date: Thu, 26 Jul 2007 21:56:44 -0700 Subject: [Mailman-Developers] Improving the archives In-Reply-To: <200707270320.l6R3KXCJ028654@gator.earlhood.com> References: <8BA8AA8B-2575-4794-AEB5-CF4CFAE99CE6@zone12.com> <849198AE-DEC3-44C8-A090-470720624185@python.org> <87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp> <7FA2A94F-9C4A-48CA-A25E-677F18BAB17A@python.org> <874pjumgrg.fsf@uwakimon.sk.tsukuba.ac.jp> <419AFBBF-82FF-4939-B85B-85055A1B8482@python.org> <200707270320.l6R3KXCJ028654@gator.earlhood.com> Message-ID: > If you are relying on the sender to do the right thing, then > why not force them to create proper message-ids? I think Barry's proposal is essentially a numbers game - e.g. he's hoping for significantly better results using "Date" in the calculation than not using it. http://wiki.list.org/display/DEV/Stable+URLs I'll try to tease out some more useful stats from some large datasets this weekend. (I can't just run the python scripts as is because I don't have python 2.5 in the same place as the data, I don't keep raw message in mbox format, blah blah blah, but we'll figure it out). My hypothesis is "Date" doesn't really buy much, but that's in part because I have a vested interest in that outcome. We'll see how the data plays out. And I still think RFC2369 headers are needed in the calculation if cross posted messages are to be handled correctly. Jeff