[Mailman-Developers] Huge lists

bwarsaw@python.org bwarsaw@python.org
Fri, 2 Jun 2000 16:38:18 -0400 (EDT)


>>>>> "CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:

    CVR> The important thing is to work on optimizing dropping mail
    CVR> into the MTA, and then letting the MTA do its job (and tune
    CVR> it!). In my work, the biggest problems is the way MLMs drop
    CVR> mail off for delivery -- usually by doing nothing more than a
    CVR> single-threaded drop sorted by a reversed address. That puts
    CVR> domain names together (but really ought to be a sort by MX
    CVR> instead, it's quite sub-optimal, especially with big
    CVR> domain-hosting sites like iname.com), but still limits
    CVR> delivery to how fast the MTA will accept addresses, and
    CVR> that's limited by DNS, since sendmail resolves addresses as
    CVR> it accepts mail...

This is very interesting.  I've just added some code to SMTPDirect.py
to support multiple MTA drop-off threads.  I actually see worse
performance with this code, which either means I screwed up the
implementation or Something Else Is Going On.

i set up a 25k+ list and timed the drop-off.  With sequential, single
threaded delivery, I could deliver a small message to Postfix in about
41 seconds.  Chunking the recips to 500 gave me about 50 chunks, and
if I give each chunk it's own thread, I'm seeing no better than 58
seconds for delivery.

I have a suspicion about what else might be happening, but I'm not
sure.  Currently, Python has a fairly limited threading model.  There
is a global interpreter lock which only allows a single thread to be
running Python code at any time.  This works well if you're doing a
lot of I/O, but not so well for other cpu intensive calculations.
Eventually, Python will probably support "free threading" which will
allow multiple threads running Python code.

Now to my eyes, the drop-off part is mostly I/O, shoving data across
the socket, so I dunno.  And maybe based on what Chuq says above, the
threading approach would work better for Sendmail than for Postfix.
So I think I will keep the code, but disable it by default and let you
guys play with it.  Please note that not all Python interpreters are
built with threading enabled.  It looks like the latest RH rpms are
built with threading, but if you've built Python 1.5.2 from source,
you'll have to explicitly configure --with-threads (I hope to change
that for Python 1.6).

An alternative would be to fork off separate processes, but that seems
too heavyweight, and makes collating the failures from the
subprocesses more difficult.

    CVR> I'd guess you can get the first 90% simply by setting up
    CVR> mailman to deliver using four threads: .com, .net,
    CVR> .edu/.us/.ca and everything else, and allowing people to
    CVR> configure extra threads if the capacity allows (and I'd do
    CVR> that by splitting each feed in half), and then coming up with
    CVR> guides on how to tune the MTA for fast delivery (I'm just
    CVR> starting to figure out sendmail 8.10, but it looks like a
    CVR> nice improvement over earlier releases).

So with the next check-in, the chunking algorithm is to create 4
buckets: .com, .net/.org, .edu/.us/.ca, everything-else.  Chunks in
these buckets are no bigger than SMTP_MAX_RCPTS and buckets are not
back-filled.

Eventually we'll need to separate out the entire hand-off process from
the main process, but that's not currently feasible.  I think this is
the best I can do for 2.0.

-Barry