[Mailman-Developers] Regarding Handlers/SMTPDirect.py and "chunkify"

Tue May 13 00:57:29 CEST 2008

Stefan Förster wrote:
>
>Am 12.05.2008 um 23:20 schrieb Mark Sapiro:
>> I understand what you are saying, but I wonder what the real world
>> difference would be. As currently written, chunkify returns at most 4
>> partially filled chunks. Granted, 4 is significantly bigger than one,
>> but given that the MTA is VERPing the deliveries, it may ultimately
>> create an outgoing queue entry for each recipient anyway, so the extra
>> 3 on the inbound side doesn't seem that significant (and it might
>> increase parallelism in the MTA).
>
>First of all, I just noticed that the official code does indeed only
>create at most 4 partially filled buckets. That's the problem when you 
>have to jump in for someone else: My SMTPDirect.py contains 26 TLDs. 
>Two thoughts:
>
>1. Even with only four buckets, when we have a real world distribution 
>amongst recipient addresses, this is four times the I/O needed. The
>ratio get's better with the number of list subscribers growing, but if 
>there are less recipients than SMTP_MAX_RCPTS, it's exactly at 1:4.

True.

>2. Why even split recipients the way it's done now at all? You have to 
>either add new buckets (add new TLDs) or have all recipients outside
>the hard coded TLDs be thrown into the same bucket. I could understand 
>it if you first created a list of TLDs involved and sorted by those - 
>though I don't know if it's a good idea if you run a really large list 
>and examine all recipients...

This predates my experience with Mailman. It is based on the statistics
provided by Chuq and outlined in the FAQ. It's true that these
statistics may only be applicable to lists with primarily US members,
and may be outdated in any case, but I can't provide any more
information on why it's done that way. Perhaps it's an idea that's
outlived its usefulness.

>I didn't understand what you said about VERPing and outgoing queue
>entries - surely any MTA will keep track of recipients on a per
>message basis?

I wasn't thinking clearly. I'm sure you're correct.

>As for parallelism, I think the best way to ensure fast 
>delivery is to make all target destinations known to the MTA as fast
>as possible.
>
>> Given your 25000 member list, and assuming SMTP_MAX_RCPTS = 500, you
>> would have at most 54 chunks (and more likely 53 or 52) instead of 50.
>>
>> In any case, If I were coding this, I would be inclined to not make it
>> an option, but just to change chunkify so it still grouped, but
>> continued to fill the last chunk of a group from the next group so
>> there would be at most one partial chunk.
>
>At the moment, I changed the code to simply return SMTP_MAX_RCPTS per
>chunk - or all recipients if there are less than that. Hardcoded, not
>configurable. The way it is done now I can't see any real advantages - 
>especially living outside the U.S. Either improve the sorting 
>algorithm (all TLDs, don't return partial chunks) or make it
>configurable to skip sorting altogether. Or at least that's what I
>feel would be an improvement. Have it default to flat chunking. It
>saves CPU time, I/O operations and gives the MTAs queue manager more 
>time to do it's job.

I think you make a good argument. I'd like to hear from others on this.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan