[Mailman-Developers] Huge lists
Nigel Metheringham
Nigel.Metheringham@VData.co.uk
Thu, 25 May 2000 10:32:41 +0100
[Personal Ccs deleted... list only this time]
Its nice to see you folks have been enjoying yourselves whilst I sleep.
However I now have the advantage, so will respond to the dozen or so
messages in the last batch :-) ]
chuqui@plaidworks.com said:
> throwing hardware at a problem isn't always possible. but the place
> where rolling your own internal MTA starts becoming useful is when
> the list is big enough that the disk I/O involving the MTA starts
> becoming the significant limiter. With sendmail 8.9.x, that's fairly
> easy to run into. With sendmail 8.10, it seems to be better, and the
> multiple queue stuff solves a multitude of problems involving huge
> directory structures.
Wietse had some figures on MTA performance analysis which he used as
part of the design process for Postfix. He concluded that disk I/O was
*the* limiting factor for an MTA - remember that to comply with the
RFCs you have to commit incoming data to stable storage before
acknowledging receipt (ie the positive reply to SMTP end of data) - in
all current mainstream MTAs that means that the queue file has to be
closed and synced. Pushing data down to the rust and ensuring its
there stably limits things drastically. Wietse's tests should be on
www.postfix.org
> VERP exacerbates the problem, since # of batches sent to the MTA
> equals the # of addresses, which explodes the number of control
> files, which... So at some point, it makes sense to deliver direct to
> recipient rather than build batches into the MTA, and completely
> avoid the disk I/O and deliver right out of the database to the
> receiving SMTP client. You could strongly parallelize the delivery
> setup because you'd do away with all of the MTA overhead, and do all
> sorts of fun things, like prioritize your delivery sorting and the
> like.
If we have a million user list... and a message of a few K, I'm not
sure I want to have a few GB of queue space taken up. If some idiot
sends a 1M attachment I doubt many of us have the TB spool space.
Having said that I *really* would like the possibility of the
occaisional message (maybe even just the password reminders.. although
I'd prefer a method where some messages if the list was in a state
where it has recently seen bounces that it cannot tie to a particular
subscriber) be sent out using VERP. However then we also need to
recode the MTA incoming handling to take that - aliases don't cut it
any more.
------
The queueing stuff is interesting, although big list focused boxes are
likely to not be the primary users of mailman - however if the exim
list is anything to go by those (big list) users will be among the most
vocal and contribute most ideas and code. [I have worked on big mail
systems, but not really big list systems]
claw@kanga.nu said:
> Sorting the RCPT TO list by domain costs us very little (esp if we
> sort on insertion), and can help users of dumb MTAs considerably.
Yup...
chuqui@plaidworks.com said:
> You could make a good argument that the best way to optimize is to
> create one mail batch per unique hostname, up to SMTP-MAX-RCPTS, at
> which point you split it into num_addrs/SMTP-MAX-RCPTS batches for
> that hostname, and then let the MTA sort if out from there.
Counter examples are always problems.... The biggest UK ISP group
(several "virtual" ISPs use the same bulk ISP service set) has a few
million users each of whom have their own domain name - so you will
find that *.freeserve.co.uk (around 2 million domains) all goes to the
same batch of MXes. This means that a good approach (for this type of
account naming) would be to pack in sets of addresses in reverse domain
order until you had a batch of SMTP-MAX-RCPTS (obviously you
additionally optomise this by also making sure that a single domain is
not split over 2 batches unless the number of addresses in that domain
are larger than a batch).
As for a quick description of exim queueing practices:-
- Queues are processed in a basically random order... incoming
messages however *normally* have a delivery process invoked for
them immediately after end-of-smtp-data (there is policy
associated here - can be tweaked)
- Each domain/address/message have retry hints associated with it
if the retry time for a message/domain/address has not been hit
then it is not taken further - so often a group of messages in
the queue are skipped on each queue run because their retry
time has not arrived
- Exim resolves all undelivered addresses in a message
and groups them by MX (lets ignore alternative delivery schemes
here)
- Each MX set has delivery attempted (there may be parallelism here)
- If the MX set can be contacted then the message is shoved down the
pipe, then the hints database is checked for other messages
outstanding
on that MX set - if so then the pipe is passed to another delivery
process invoked on one of the waiting messages
- If MX set was *not* successful then the hints are updated to say
this message has addresses outstanding on that MX
So in the normal case each delivery process delivers only to the
addresses in the message its dealing with - each message is independent
so you may have several SMTPs to the same place for different messages.
If things clog up then hints help make things more efficient. [these
are hints - sometimes they are ignored, and trashing the hints db is
quite OK]. This all works pretty well in practice. You can if you
want a particular type of efficiency rearrange things - ie make all
messages resolve, but only deliver on queue runs, which means that
messages for the same destination host are nearly always batched down a
single SMTP session.
[On per-MTA documentation]
Lets start bullying^Wpersuading people to contribute some documentation
on this stuff or pointers to existing MTA documentation that addresses
this. The question of MTA configuration for medium size lists is
pretty common, so there must be tuning data around. I guess I could
collate if needed [sigh]
Big lists are a different issue - you need to *choose* your MTA and
hardware within your constraints for that. Tuning is probably a
consultancy job for those.
chuqui@plaidworks.com said:
> There are exchange sites out there who's idea of a bounce message is
> to return the mail to the "to:" line with only the Message-ID
> changed. you can imagine how much fun THAT is.
More special bounce filters needed :-)
I *like* the way that mailman is now dealing with an impressive
proportion of bounces. I need to write an extra filter to make it drop
delay warning messages, other than that theres very little stuff
getting through to me in the way of bounces.
That particular one you mention should be blocked from the net -
presumably their upstream is clueless too.
Nigel.
--
[ - Opinions expressed are personal and may not be shared by VData - ]
[ Nigel Metheringham Nigel.Metheringham@VData.co.uk ]
[ Phone: +44 1423 850000 Fax +44 1423 858866 ]