[Mailman-Developers] chunkify suggestion, with patch.

Darrell Fuhriman darrell@grumblesmurf.net
Mon, 17 Jul 2000 03:39:30 -0700 (PDT)


I've sent a modified version of chunkify() that behaves, I think, I bit
more intelligently.

I've modified it so that it sorts the recipient list by the domain, then
breaks that list into chunks of SMTP_MAX_RCPTS.  I've also changed the
Default SMTP_MAX_RCPTS to 20 for reasons given below.

The advantage of doing it this way is that you get the advantages of
smaller recipients passed of too the MTA and consequently better turn
around time to the list members.  By sorting it, any domains that will be
unreachable are grouped together, thus punishing only that delivery
process.

Ideally, all messages intended for a particular domain would end up in the
same delivery process, but my code doesn't do that right now.  This should
probably entail making a second tuneable variable, say, SMTP_MAX_QUEUE,
which is the chunk size, and returning SMTP_MAX_RCPTS to it's original
intention of not exceeding an MTA limit.  Then if you had many recipients
in a single domain, you could group them all together in one chunk
exceeding SMTP_MAX_QUEUE, but not more than SMTP_MAX_RCPTS.

But, like I said, that doesn't happen right now.

Some of you may recognize this as the idea behind bulk_mailer. :)

Also, one problem with this code is that the domain_sort fucntion is
really rather slow.  If anyone has suggestions for speeding it up, that'd
be great.  :)

Darrell

Diffs against b4

--- Mailman/Handlers/SMTPDirect.py~	Fri Jun  2 21:59:45 2000
+++ Mailman/Handlers/SMTPDirect.py	Mon Jul 17 03:19:46 2000
@@ -106,48 +106,79 @@
 
 
 def chunkify(recips, chunksize):
-    # First do a simple sort on top level domain.  It probably doesn't buy us
-    # much to try to sort on MX record -- that's the MTA's job.  We're just
-    # trying to avoid getting a max recips error.  Split the chunks along
-    # these lines (as suggested originally by Chuq Von Rospach and slightly
-    # elaborated by BAW).
-    chunkmap = {'com': 1,
-                'net': 2,
-                'org': 2,
-                'edu': 3,
-                'us' : 3,
-                'ca' : 3,
-                }
-    buckets = {}
-    for r in recips:
-        tld = None
-        i = string.rfind(r, '.')
-        if i >= 0:
-            tld = r[i+1:]
-        bin = chunkmap.get(tld, 0)
-        bucket = buckets.get(bin, [])
-        bucket.append(r)
-        buckets[bin] = bucket
+    # If we turn down the chunksize (i.e. SMTP_MAX_RCPTS), and have
+    # the addresses sorted by domain, it's much nicer to the MTA and
+    # to the users.  (In the majordomo world, this is what bulk_mailer
+    # would do.)
+    # In an ideal world, a single domain wouldn't be split across
+    # multiple chunks unless a someother threshold had been met.
+    # I'll save that for sometime when it's not 2:30am.  :)
+
+    recips.sort(domain_sort)
+    
     # Now start filling the chunks
     chunks = []
     currentchunk = []
-    chunklen = 0
-    for bin in buckets.values():
-        for r in bin:
-            currentchunk.append(r)
-            chunklen = chunklen + 1
-            if chunklen >= chunksize:
-                chunks.append(currentchunk)
-                currentchunk = []
-                chunklen = 0
-        if currentchunk:
+    for recip in recips:
+        if len(currentchunk) >= chunksize:
             chunks.append(currentchunk)
             currentchunk = []
-            chunklen = 0
+        currentchunk.append(recip)
+    if len(currentchunk) != 0:
+        chunks.append(currentchunk)
     return chunks
+            
+
+def domain_sort(x, y):
+    x_longer = 0
+    y_longer = 0
 
+    # split the user from the rest (we may need the username later)
+    x_tmp = string.split(x, '@');
+    y_tmp = string.split(y, '@');
+    
+    x_list = string.split(x_tmp[1], '.')
+    y_list = string.split(y_tmp[1], '.')
+    x_user = x_tmp[0]
+    y_user = y_tmp[0]
 
+    # now reverse it, to make the code cleaner
+    x_list.reverse()
+    y_list.reverse()
+
+    # find out which domain is shorter
+    # or if they're the same length
+    if len(x_list) == len(y_list):
+        for i in range(0, len(x_list)):
+            ret = cmp(x_list[i], y_list[i])
+            if ret != 0:
+                return ret
+        # if they're the same length, and we get to this
+        # point, it's because they're identical domains.
+        # we'll just compare on the username, 'cause we can
+        return cmp(x_user, y_user)
+    elif len(x_list) < len(y_list):
+        length=len(x_list)
+        y_longer = 1
+    else:
+        length=len(y_list)
+        x_longer = 1
+        
+    for i in range(0, length):
+        ret = cmp(x_list[i], y_list[i])
+        if ret != 0:
+            return ret
+    # if we get to the point, we've got two domains of the
+    # form: a.foo.com and foo.com and we decide the longer
+    # one goes first
+    if x_longer == 1:
+        return -1
+    if y_longer == 1:
+        return 1
+    # we should never get here
+    return 0
 
+
 def pre_deliver(envsender, msgtext, failures, chunkq):
     while 1:
         # Get the next recipient chunk, if there is one
--- Mailman/Defaults.py~	Mon Jul 17 03:23:09 2000
+++ Mailman/Defaults.py	Mon Jul 17 03:16:47 2000
@@ -56,7 +56,7 @@
 PUBLIC_ARCHIVE_URL  = '/pipermail'
 PRIVATE_ARCHIVE_URL = '/mailman/private'
 
-COME_PAGE         = 'index.html'
+HOME_PAGE         = 'index.html'
 MAILMAN_OWNER     = 'mailman-owner@%s' % DEFAULT_HOST_NAME
 
 
@@ -128,7 +128,7 @@
 # Ceiling on the number of recipients that can be specified in a single SMTP
 # transaction.  Set to 0 to submit the entire recipient list in one
 # transaction.  Only used with the SMTPDirect DELIVERY_MODULE.
-SMTP_MAX_RCPTS = 500
+SMTP_MAX_RCPTS = 20
 
 # Maximum number of simulatenous subthreads that will be used for SMTP
 # delivery.  After the recipients list is chunked according to SMTP_MAX_RCPTS,