[Mailman-Users] Filtering out duplicate emails

Sat Feb 27 17:06:57 CET 2010

Chiang Wu wrote:

>So I'm trying to implement a custom handler in Mailman and I'm a bit stuck.
>I wrote up some Python code in order to try to compare an email to a file to
>try to not sent out the e-mail if the email's message matches something in
>the text file. I can't seem to find any examples of custom handlers, so this
>is what I came up with so far. Any help would be appreciated.

You apparently found ToOutgoing.py. That and all the other handlers in
Mailman/Handlers/ are examples of handlers. The only difference
between these and 'custom' handlers is the fact that these are in the
default pipeline.

Have you read the FAQ at <http://wiki.list.org/x/l4A9>?

>#pythonhandlerv2.py
>#Version 2
>
>#Receive E-mail.
>#Compare e-mail to filter file.
>#Compare Content vs a file. File will have messages that should be examined
>against the current email.
>#If matches, remove ToOutgoing from the pipeline in the message's metadata.
>
>import sys
>
>from Mailman import mm_cfg
>from Mailman import Errors
>from Mailman.Logging.Syslog import syslog
>from Mailman.Queue.sbcache import get_switchboard

The module level imports above are fine, but note that all module level
code will be executed when the module is imported and before the
modules process() function is called.

># gets all the lines of the message
>msgtext = str(msg)

That means that the above throws an exception immediately because msg
is undefined. All of this needs to be in the process() function
definition.

>#test to see if log file catches the message
>
>FILE = open('log.txt', 'w')
>FILE.writelines(msgtxt)
>

The stuff below could be also in the process() definition or in a
separate function definition called from within process. If it's at
the module level as here, it will be executed once only the first time
IncomingRunner imports the handler and then never again.

>
>message = []
>n=0
>inputline = ''
>fullline = ''
>infile = file('test.txt', 'r')
>
>inputline = infile.readline()
>
>
>while True:
>
>   if len(inputline) == 0: #EOF if length = 0. Breaks out of the while loop
>      break
>
>   else:   #read in between delimiter and other delimiter (~~) and place
>into message
>
>      if inputline == '~~\n':
>         message.append(fullline)
>         fullline = ''
>         n+=1
>         inputline = infile.readline()
>
>      else:
>         fullline += inputline
>         inputline = infile.readline()
>
>n-=1
>
>#print repr(strvalue2)
>
>
>while n>=0:
>
>   message[n]=message[n][:-1] #Removes the last new line character (\n) in
>the message
>   #print repr(message[n])
>   #compare message[] to the email text. if matches, break and exit program
>   if message[n] in msgtext:
>      sys.exit(0)
>
>
>   else: #else keep looping until end of loop and place outgoing.py after
>the loop
>      n-=1

See below...

>#ToOutgoing.py portion of the code
>
>def process(mlist, msg, msgdata):
>    interval = mm_cfg.VERP_DELIVERY_INTERVAL
>    # Should we VERP this message?  If personalization is enabled for this
>    # list and VERP_PERSONALIZED_DELIVERIES is true, then yes we VERP it.
>    # Also, if personalization is /not/ enabled, but VERP_DELIVERY_INTERVAL
>is
>    # set (and we've hit this interval), then again, this message should be
>    # VERPed. Otherwise, no.
>    #
>    # Note that the verp flag may already be set, e.g. by mailpasswds using
>    # VERP_PASSWORD_REMINDERS.  Preserve any existing verp flag.
>    if msgdata.has_key('verp'):
>        pass
>    elif mlist.personalize:
>        if mm_cfg.VERP_PERSONALIZED_DELIVERIES:
>            msgdata['verp'] = 1
>    elif interval == 0:
>        # Never VERP
>        pass
>    elif interval == 1:
>        # VERP every time
>        msgdata['verp'] = 1
>    else:
>        # VERP every `inteval' number of times
>        msgdata['verp'] = not int(mlist.post_id) % interval
>    # And now drop the message in qfiles/out
>    outq = get_switchboard(mm_cfg.OUTQUEUE_DIR)
>    outq.enqueue(msg, msgdata, listname=mlist.internal_name())

Essentially, your handler duplicates ToOutgoing in that the only part
of your code that would be called for each message is the process()
function which does the process of ToOutgoing. If your intent is to
replace ToOutgoing.py with this handler, you would put your code
inside the definition of the process() function and just return
without queueing the message if it was a duplicate.

But, then this would not be a custom handler. It would just be a
patched version of ToOutgoing.py, and you would lose all the
advantages of a custom handler.

What I meant by "remove 'ToOutgoing' from the pipeline in this messages
metadata" was something like:

def process(mlist, msg, msgdata):

    (code to determine if this msg is a duplicate)

    if duplicate:
        try:
            msgdata['pipeline'].remove('ToOutgoing')
        except ValueError:
            # ToOutgoing wasn't there
            pass

Then as long as your handler precedes ToOutgoing in the pipeline, if
its duplicate code determines this msg is a duplicate and sets
duplicate True, it will remove ToOutgoing from the list of handlers
remaining to be processed, and the message will continue to be
processed normally including archiving, and appearing in the digests,
but it won't be queued for delivery by ToOutgoing.

>On Thu, Feb 11, 2010 at 7:50 AM, Mark Sapiro <mark at msapiro.net> wrote:
>
>> Chiang Wu wrote:
>>
>> >Hi. I was wondering if anyone knows a way to filter out, yet still
>> archieve
>> >new e-mails that have the same content as an already archived e-mail.
>>
>>
>> You would have to write a custom handler[1] to examine the content of
>> this mail and compare it to something and if it is a 'duplicate',
>> remove 'ToOutgoing' from the pipeline in this messages metadata.
>>
>> Comparing it to the archive is problematic because archiving is
>> asynchronous with incoming message processing, and if two 'duplicates'
>> arrive close in time, the first may not be archived when you process
>> the second.
>>
>> If you want to avoid truly identical content, this handler could keep a
>> small database of some hash of the content and its process time for
>> lookup as subsequent messages arrive. If the 'duplicate' content
>> differs in things like time stamps only, you could filter those before
>> hashing.
>>
>>
>>
>> [1] <http://wiki.list.org/x/l4A9>

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan