[Mailman-Developers] Re: [Mailman-Users] Allowing users to join without specifying pas swords

Sun, 17 Jun 2001 00:46:47 -0700

On Friday, June 15, 2001, at 01:19 PM, Barry A. Warsaw wrote:

>     CVR> points. but we need to quantify what those points are and
>     CVR> what the impact is, so we can decide just how to move forward
>     CVR> on this.

> I'd love to see any statistic you (or anybody) gathers on this
> subject.  It's definitely intriguing, but right now I don't have the
> time or systems to do this kind of data gathering.
>

Okay, here's a first cut at some data.

I'm going to assume the following:

1000 subscribers -- no digest subscribers to simplify this. Assume just 
individual messages.

The message size is 10K, including header.

The bandwidth needed to generate a connection to send a message is 1K 
(which is pretty close)

The bandwidth needed to add an address to an existing message is about 
1/10 of a K (also pretty close).

The practical limit to the number of messages you can piggyback is 100, 
since this is specified in RFC2821 as the smallest number a site is 
REQUIRED to take. In practice, due to non-conformant sites, you have to 
be careful setting it beyond 50 these days, because sites set this 
number down because they think it slows down the spammers (I'm yet to be 
convinced it makes a damn bit a difference, especially since MTAs like 
postifx recognize the 452 and auto-adjust now. This is another place 
where sendmail seems behind the technology curve, FWIW)

How much bandwidth is used depends on these factors:

what your piggyback value is (in mailman, it's SMTP_MAX_RCPTS)

how many domains have > 1 subscriber.

Here's how plaidworks breaks down:

3101 subscribers across 1287 domains. that's an average of 2.3 
subscribers per domain, but the numbers skew wildly, so averages are 
meaningless.

Here's how my site breaks down:

# of subscribers			# of domains/# of users
---------------------			-----------------
1						263/263
2						142/284
3						40/120
4						19/76
5						16/80
6						10/60
7						7/49
8						3/24
9						6/54
10						2/20
11						2/22
12						2/24
13						1/
14						1/
16						1/
17						1/ (worldnet.att.net)
22						1/(juno.com)
29						1 (mindspring.com)
30						1 (pacbell.net)
35						1 (plaidworks.com)
43						1 (sympatico.ca)
53						1 (earthlink.net)
150						1 (home.com)
173						1 (yahoo.com)
228						1 (hotmail.com)
441						1 (aol.com)

if you're scoring at home, 37% of subscribers come from that last 4 
domains: 5% for home and yahoo, 7% for hotmail, and 14% for aol. those 
are your 500 pound gorillas (AOL is 800 pounds), and piss them off at 
your own risk.

At the other end, 8% of your users are the only subscriber from a 
domain. 16% are 1 or 2 per domain. 26% are on sites with 5 or fewer 
subscribers.

Time for some numbers.

Back to the 1000 member list for simplicity. The subscriber list breaks 
down to:

85	- 	1/85
45	-	2/90
12	-	3/36
6	-	4/24
[...]
48	-	1
55	-	1
73	-	1
142	-	1

That's 553, or 55% of the subscribers, wedged tightly on both ends of 
the curve. We can extrapolate what they'll do to bandwidth from the end 
cases if we need to.

Extreme case: SMTP_MAX_RCPTS = 1.

1000 subscribers * (10K message size + 1K overhead) = 11,000K bytes 
bandwidth.

Extreme case: SMTP_MAX_RCPTS = 100

These get sent  down the line this way:

85 * 11K
45 * (1 * 11K + 1 * .1K)
12 * (1 * 11K + 2 * .1K
6 * (1 * 11K + 3 * .1K)
[...]
1 * 11K + 47 * .1K
1 * 11K + 54 * .1K
1 * 11K + 72 * .1K
2 * 11K + 140 * .1K

Do you see how I got these numbers? In the case of the 12 domains with 
three subscribers, you have to make an 11K connection for the first 
message, and piggy back on the other two addresses at .01K each. You 
don't really see huge savings until the big domains, and you'll see AOL 
goes over the 100 address limit so gets split into two different 
messages.

For this 55%, the SMTP=1 is 6050K. For 100, it's 1711K bytes. That's 28% 
of the first number, so we're cutting 72% of the bandwidth by chunking 
at 100. The tradeoff is performance, though -- it takes a lot longer to 
deliver those AOL addresses, because if you split it into two batches, 
you can't parallelize the delivery. Package up 100 AOL addresses in one 
batch, none of them get delivered until all 100 addresses are sent to 
AOL and accepted. It's much faster to send them as ten batches of ten in 
parallel -- but that's the trade off here. Cut network bandwidth but 
slow delivery to the larger domains.

Okay, let's look at a case in the middle. SMTP_MAX = 5. The ones with 
less than 5 don't change, but the big domains do

85 * 11K
45 * (1 * 11K + 1 * .1K)
12 * (1 * 11K + 2 * .1K
6 * (1 * 11K + 3 * .1K)
[...]
1 * (10 * 11K + 38 * .1k)
1 * (11 * 11K + 44 * .1K)
1 * (15 * 11K + 58 * .1K)
1* (29 * 11K + 113 * .1K)

that works out to (trust me) about 2378K, or about a 60% reduction.

Let's try SMTP_MAX = 2.

85 * 11K
45 * (1 * 11K + 1 * .1K)
12 * (2 * 11K + 1 * .1K
6 * (2 * 11K + 2 * .1K)
[...]
1 * (10 * 11K + 38 * .1k)
1 * (11 * 11K + 44 * .1K)
1 * (15 * 11K + 58 * .1K)
1* (29 * 11K + 113 * .1K)

that works out to 2575K, or about a 57% cut.

By a rough look at those domains in the middle, I'd say these numbers 
are good +-10%.

What's this mean? Here's the executive summary:

The network penalty between SMTP_MAX = 1 (effectively VERP) and any kind 
of batching (SMTP > 1) is roughly 50%. To get VERP or customized footers 
or customized anything, you double your network bandwidth.

There is very little advantage to setting SMTP_MAX > 5, UNLESS your 
subscriber base is heavily stratified onto very few sites. If you have 
really large groups of subscribers on AOL or Hotmail, it can help cut 
network bandwidth, but at best, it seems to be about a 10% improvement. 
If you plot the numbers I did on a curve, you can see just how little 
advantage you get by increasing the number. You get almost all of the 
advantage by going to 2, and the line past 5 is very flat....

Interesting -- I honestly didn't expect to see THIS big a difference -- 
I was expecting more like 25-30% increase in bandwidth for a VERP-type 
delivery.

My thoughts on what this means to future directions:

Customized messages (VERPing, or encoded unsub URLs, or all of that...) 
should definitely be an option in Mailman 2.1.

I would set Mailman's 2.1 default to have this turned ON, giving us the 
customized unsub links and etc, but to document this for users so they 
know to turn it off on slow networks.

If users turn it off, I recommend that SMTP_MAX be set by default to 5, 
and that we document that it makes little sense to change it unless a 
site is horribly network limited, because even setting to the max only 
gains them another 10% (and if they're THAT network limited, they're 
seriously asking for trouble anyway), and only if their subscriber base 
fits a profile that lends itself to the compression. Setting it large 
also leaves them open to spamblocking by systems that don't necessarily 
follow the standards or act right, too.

We should ALSO note here that some MTAs (postfix, for instance) might 
override SMTP_MAX anyway -- you could set it to 100, but postfix might 
be configured smaller, so they have to be aware of those potential 
interactions. you then get into the issues of tuning all this, with few 
delivery threads with lots of addresses vs many threads in parallel.. 
and all that fun -- I guess I'm trying to say that you can't tune 
mailman in isolation from the MTA (and down that road lies a huge 
rathole of attempting to document this stuff...)

But from these numbers, any 2.0.x version of mailman should set SMTP_MAX 
to between 2 and 5, unless they're horribly network limited. it makes no 
sense to be larger than 5, and it makes no sense to be 1 unless you've 
done some kind of VERPing patch.

for 2.1, we want to implement these customizations and default them on, 
but with a 50% network hit, we definitely want to make it clear what's 
going on and make it possible for them to turn it off and return to a 
generic URL and non-customized e-mail.

Barry's mileage may vary on his preferences for default, of course, and 
it's his show. I think the advantages of the customized URL/email 
capability is a huge one and most sites will benefit from it -- but the 
network hit might kill some sites, so we have to give them an easy 
ability to turn the feature off.

What do y'all think? I've included mailman-developers on this reply, 
since while this started on mm-users, it really ought to be discussed on 
the developers list...

--
Chuq Von Rospach, Internet Gnome <http://www.chuqui.com>
[<chuqui@plaidworks.com> = <me@chuqui.com> = <chuq@apple.com>]
Yes, yes, I've finally finished my home page. Lucky you.

Yes, I am an agent of Satan, but my duties
are largely ceremonial.