[Mailman-Developers] Updated dupe removal patch
Marc MERLIN
marc_news@vasoftware.com
Mon, 4 Mar 2002 15:12:40 -0800
On Mon, Mar 04, 2002 at 05:16:21PM -0500, Barry A. Warsaw wrote:
> Okay, I've looked over all the code. Except for some stylistic
> issues, which I'll just correct as I go, my biggest concern is the
> database used in AvoidDuplicates.py.
Yeah, I looked at that too, but being tired, I didn't get as critical as you
did. I figured it somehow worked ok for Ben for his lists (bad Marc, no
cookie)
Now that I think of it, Ben wrote initially wrote this for qrunner
from cron, and didn't think about this issue when he ported it to
mailman-cvs-sept-2001
> It looks like you're keeping an in-memory dictionary mapping
> addresses to a set of Message-ID:'s. You use this to decide if the
> recipient address has already received a message of the given
> Message-ID:
That was my understanding of the code too.
> Let's ignore the duplicate or missing Message-ID: issue for now. The
> biggest problem I see is that 1) you lose all the mappings if you
> restart your IncomingRunner, and
That's probably not a problem because
1) it would only affect a message being processed at the time you kill
and restart IncomingRunner, not very likely, and worst case, you do
get a second copy.
2) You don't restart IncomingRunner often if at all
3) When you do restart qrunner, there can be other qwirks, like a message
being delivered twice (I've seen this with VERP enabled, I probably
killed it while it was delivering a batch to exim, so it didn't complete
and did it all over again after the restart)
> 2) your process will grow without bounds until you do restart your
> IncomingRunner.
I think you're right. You'd have to have a lot of traffic before it catches
up with you, but it will eventually if you never restart qrunner.
> I'm not sure about the best thing to do. Sticking this data structure
> in the list, or otherwise making it persistent, could take too much
> resources for not much gain. The second issue is more important,
> especially given that all our runners are now long running processes,
> and I think most of the unbounded memory growth issues are taken care
> of. Probably the best thing to do is to evict any entry in the
> dictionary that's older than a day or two.
That sounds like a reasonable plan.
> Then again, this whole data structure seems intended to avoid
> duplicates when lists are crossposted. It shouldn't be necessary if
> we just want to filter out duplicates to explicitly named recipients.
> Maybe we don't need both features, as the former seems to be much less
> requested than the latter?
That's true. The later is nice for instance when you have threads Cced
accross mailman-devel and mailman-users, but having the former by itself
would be good already.
If this is a time issue wrt fixing the code, the duplicate message-ID code
could be left behind a global option that is disabled by default and a
comment saying that you should be ready to restart your qrunner weekly or
daily if you enable it
That said, adding a timestamp to them, and deleting everything that's more
than 1H old is a better solution.
> I think what I'll do for now is code up and test the original
> approach. I'm on irc now so please join me if you want to talk about
> it.
>
> now-if-i-can-just-get-OPN-to-stop-kicking-me-off!-ly y'rs,
<-- barry has quit (Read error: 110 (Connection timed out))
:-)
I was saying there:
about OPN, I don't know the details (and there were also politics), but the
sf.net guys moved their channels to slashnet.org. Seems to work better there
Marc
--
Microsoft is to operating systems & security ....
.... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key