[Mailman-Developers] Improving the archives

Jeff Breidenbach jeff at jab.org
Thu Aug 2 04:17:44 CEST 2007


> What we really want to know is how many (non-empty) Message-ID
> collisions are there that *don't* share a Date?  This is the number of
> messages that only-messageid loses, and that the composite identifier
> method would not lose.

It took longer than expected, but I now have numbers from
looking at 2,151,896 messages spread over a few thousand
lists. The appended script was run over a set of MH format
raw messages.

704 messages fall into this category. Of these, 596 come from a
single (malfunctioning and duplicate spewing) list server. I have
not yet examined the remaining 208 messages, but I'll bet anything
many also have duplicate message bodies. Or are spam. So for this
data set, we have an upper bound of 0.01% messages in this
category, possibly significantly less.

Jeff


#!/bin/bash
#
# Look for messages that
#
# Do collide with message-id
# Don't collide with message-id + date

DIR=/home/archive/Mail

C1=0
C2=0

get_ineresting_messages() {
    cd $DIR/$1
    for j in $(ls -U); do
        MSG_ID=$(cat $j | 822field message-id)
        MSG_DATE=$(cat $j | 822field date)
        if [ "$MSG_ID" != "" ]; then
            echo $MSG_DATE "|" $MSG_ID
        fi
    done |\
        sort |\
        uniq --separator='|' --skip-fields=1 --all-repeated |\
        uniq --uniq
}


for i in $(ls $DIR | grep @); do
    DUP=$(get_ineresting_messages $i)
    DUP_CNT=$(echo -n "$DUP" | wc -l)
    MSG_CNT=$(cd $DIR/$i && ls -U | wc -w)
    C1=$(( C1 + MSG_CNT ))
    C2=$(( C2 + DUP_CNT ))
    if [ $DUP_CNT != 0 ]; then
    echo
        echo "=== collisions/messages: $C2/$C1 $i"
        echo "$DUP"
    else
        echo -n . 1>&2
    fi
done








>
> -Dale
> _______________________________________________
> Mailman-Developers mailing list
> Mailman-Developers at python.org
> http://mail.python.org/mailman/listinfo/mailman-developers
> Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
> Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/
> Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org
>
> Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp
>


More information about the Mailman-Developers mailing list