[Mailman-Developers] Improving the archives

Wed Jul 25 06:47:23 CEST 2007

> What you gain from my proposal over a pure Message-ID approach
> is guaranteed uniqueness given the list copy

Guarantee is a pretty strong word. A malicious person could post two
messages with the same message-id, same date, but different bodies.
Sometimes the channel between the MLM and the archive server will
be SMTP, and spurious messages can be injected. Finally, from the archive
server's perspective, some of the MLMs might make mistakes - just like
from the MLM's perspective, some of MTAs might make mistakes in
setting message-id. So I don't think the proposed SHA1(date, message-id)
scheme buys a hard guarantee of uniqueness. Every component has
to protect themselves, but none can solve the world's problems.

So that moves us to how many collisions are reduced in practice.
I have a question about the numbers Barry mined from the python
lists. Are the collisions really that high? One should not count
messages without a message-id, because the MLM can and should
create one in that case.

One should also not count collisions of messages going to different
lists. Here's why. Let's say message M is cross posted to lists L1 and
L2. Even though it is the same message, there are now two different
contexts. (For example, people visit M at archive L1 should get a
completely different experience if they hit "next message" and people
visiting M at archive L2.)

So I'd be curious what the collision numbers come to with these two
factors taken into account. The other takeaway  is list name really
should be part of the URL to get proper context. The earlier example
from Mharc does this.

> and human friendlier urls.

That's a very compelling point.

SHA1 can't be computed inside someone's head or simple cut-n-pasted
together for old messages,  but I think the usability benefits of short
URLs (short enough that they can comfortably fit inside message bodies)
outweighs this drawback. By the way, is SHA-1 still in favor? My
impression was it was fading away after the Shandong University team
partially cracked it.

> Throw it away or hide [Date]?  The former would be a problem,
> but not the latter.

Thrown away. My favorite archival service is based on mhonarc,
and raw mail goes into offline cold storage. Of course this can be
changed for the future messages with some pain, but there's no
reasonable way for myself (or any other mhonarc users in the
same predicament) to retrofit against Date based URLs. For the
record, here's what mhonarc embeds in each HTML page it
produces because these were considered the important headers.
In this message sent from Australia, the date shows a timezone
of UTC -0700, because it was pulled from the received header.

<!-- MHonArc v2.6.15 -->
<!--X-Subject: [Gossip] Re: green&#45;travel resources {webliographies} -->
<!--X-From-R13: "[nephf Z. Saqvpbgg" <zraqvpbgNlnubb.pbz> -->
<!--X-Date: Wed, 26 Apr 2006 00:27:27 &#45;0700 -->
<!--X-Message-Id: 20060426072529.45761.qmail at web54507.mail.yahoo.com -->
<!--X-Content-Type: text/plain -->
<!--X-Reference: e03b90ae0604242000q70a81fcete7da4965c581c838 at mail.gmail.com -->
<!--X-Head-End-->

So my main request is to double check the numbers, see if using
"Date" really buys as much as one thinks. I'll keep digesting the
other aspects of the wiki page.