[Mailman-Developers] Improving the archives

Barry Warsaw barry at python.org
Fri Jul 20 15:49:49 CEST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 20, 2007, at 9:21 AM, Stephen J. Turnbull wrote:

>> How likely is it that two messages with the same message-id and
>> date are /not/ duplicates?
>
> For message id generators that include a time-stamp in the generated
> id, approximately the same as the probability that two messages with
> the same message-id are not duplicates, no?

Good point, though clearly not all message-ids have timestamp  
information in them.  It does help explain why I see 600-odd more  
collisions when taking other data into account too.  I've modified my  
script to sort collisions and dupes into maildir folders, so I'll  
take a closer look when that finishes running (it takes a long time  
to slog through all 5 mboxes, even on a fairly zippy dual-G5).

>> Heck, at that point, I'd feel justified in simply automatically
>> rejecting the duplicate and chucking it from the archive.
>
> I'd rather not go there.  There may be applications for the archiver
> that require that all mail received be filed.

True.  It would ultimately be an archiver policy though.

> Counterproposal: have a "collisions" namespace, and provide an
> interface for the list owner to decide what to do with them.  They
> could be thrown away, they could be given an alternative global ID
> somehow and added (eg, the archive page could add a "See probable
> duplicates too" link), or they could be put into a moderation-like
> queue for list admins to decide about.

I like this.

>> So now, think of the interface to a message store that supports this
>> addressing scheme.  Well it's something like:
>
> I don't understand how the calling application is supposed to deal
> with a DuplicateMessageError exception since it should not change
> either the Message-ID or the Date if present.
>
> I see this as a major problem with any proposal to use only author
> headers in computing the "global id".

Mailman would probably log and ignore DuplicateMessageErrors.  It  
wouldn't be Mailman's responsibility to ensure the message gets  
archived, although I concede that as currently defined, you could end  
up with list copies that had a global id header that wasn't unique.   
OTOH, if the archiver implements a collision resolution policy such  
as a 'collisions' namespace, it wouldn't ever raise  
DuplicateMessageError.

>> Or by using the global id, or by rejecting messages with duplicate
>> message ids.
>
> Er, the MTA has already accepted it.  Do you plan to generate a list
> manager bounce to the poster?  This has the unpleasant misfeature that
> it could be used to bounce spam off the list manager, since the poster
> needs to see content to determine whether this is a multiple send or
> actually the "intended version" after a "fat-finger" send; we already
> know the message-id isn't good enough.

Yes, this wouldn't be an MTA bounce, it would be a Mailman bounce.   
But it would have to be subject to the same bounce rules as any other  
auto-response which could be used as a spam vector, e.g. limit the  
number of bounces per time period and don't include the entire  
original message in the bounce (as both can be, and are used as spam  
vectors).

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqC9fnEjvBPtnXfVAQLkEQQAhdu0BIvpRvTk92m9J/sbHVRSRxBGMqta
Cm57WyRJGBxPV3xTE4ghVzXdDyIEvUjKimRTEWbeX60WqROL6FPsmAnwmsYbW3mw
8hqNXj+SpHP+1GIYnYgY9txiM75fHDa5T0VsjpcXAwtjeepHouXAEWbegBUrIzHt
EBp5YCMqxv8=
=5tjc
-----END PGP SIGNATURE-----


More information about the Mailman-Developers mailing list