[Mailman-Developers] Updated dupe removal patch

Mon, 4 Mar 2002 15:12:40 -0800

On Mon, Mar 04, 2002 at 05:16:21PM -0500, Barry A. Warsaw wrote:
> Okay, I've looked over all the code.  Except for some stylistic
> issues, which I'll just correct as I go, my biggest concern is the
> database used in AvoidDuplicates.py.

Yeah, I looked at that too, but being tired, I didn't get as critical as you
did. I figured  it somehow  worked ok for  Ben for his  lists (bad  Marc, no
cookie)
Now  that  I  think of  it,  Ben  wrote  initially  wrote this  for  qrunner
from  cron,  and  didn't  think  about  this issue  when  he  ported  it  to
mailman-cvs-sept-2001

> It looks like you're keeping an in-memory dictionary mapping
> addresses to a set of Message-ID:'s.  You use this to decide if the
> recipient address has already received a message of the given
> Message-ID:

That was my understanding of the code too.

> Let's ignore the duplicate or missing Message-ID: issue for now.  The
> biggest problem I see is that 1) you lose all the mappings if you
> restart your IncomingRunner, and 

That's probably not a problem because
1) it would only affect a message being processed at the time you kill
   and restart IncomingRunner, not very likely, and worst case, you do
   get a second copy.
2) You don't restart IncomingRunner often if at all
3) When you do restart qrunner, there can be other qwirks, like a message
   being delivered twice (I've seen this with VERP enabled, I probably
   killed it while it was delivering a batch to exim, so it didn't complete
   and did it all over again after the restart)

> 2)  your process  will  grow  without bounds  until  you  do restart  your
> IncomingRunner.

I think you're right. You'd have to have a lot of traffic before it catches
up with you, but it will eventually if you never restart qrunner.

> I'm not sure about the best thing to do.  Sticking this data structure
> in the list, or otherwise making it persistent, could take too much
> resources for not much gain.  The second issue is more important,
> especially given that all our runners are now long running processes,
> and I think most of the unbounded memory growth issues are taken care
> of.  Probably the best thing to do is to evict any entry in the
> dictionary that's older than a day or two.

That sounds like a reasonable plan.

> Then again, this whole data structure seems intended to avoid
> duplicates when lists are crossposted.  It shouldn't be necessary if
> we just want to filter out duplicates to explicitly named recipients.
> Maybe we don't need both features, as the former seems to be much less
> requested than the latter?

That's  true. The later  is nice  for instance  when you  have threads  Cced
accross mailman-devel  and mailman-users,  but having  the former  by itself
would be good already.
If this is a  time issue wrt fixing the code,  the duplicate message-ID code
could be  left behind  a global  option that  is disabled  by default  and a
comment saying  that you should be  ready to restart your  qrunner weekly or
daily if you enable it
That said, adding  a timestamp to them, and deleting  everything that's more
than 1H old is a better solution.

> I think what I'll do for now is code up and test the original
> approach.  I'm on irc now so please join me if you want to talk about
> it.
> 
> now-if-i-can-just-get-OPN-to-stop-kicking-me-off!-ly y'rs,

<-- barry has quit (Read error: 110 (Connection timed out))

:-)

I was saying there:
about OPN, I don't know the details  (and there were also politics), but the
sf.net guys moved their channels to slashnet.org. Seems to work better there

Marc
-- 
Microsoft is to operating systems & security ....
                                      .... what McDonalds is to gourmet cooking

Home page: http://marc.merlins.org/   |   Finger marc_f@merlins.org for PGP key