[Mailman-Users] Importing large archives ... design limit hit, and possible bug

Scott Courtney courtney at 4th.com
Sun Jun 2 05:21:45 CEST 2002


On Saturday 01 June 2002 10:59 pm, LuKreme wrote:
> Out of curiosity, how did you split the mbox?  I have about 1200 emails I
> want to add to the archive.

I wrote a little "awk" program to split them into 80-message chunks. Here is
the source code:

************ BEGIN LISTING ***********
#!/usr/bin/awk -f
#
# Splits an mbox formatted file into chunks of size SZ messages each.
# Sends output to "split-0001.mbox", "split-0002.mbox", and so on, for as
# many as it takes. Overwrites the output files without warning.
#
# Author: Scott Courtney (courtney at 4th.com)
# License: GNU General Public License  http://www.gnu.org/
#
# WARNING: Written for my own one-time use; not thoroughly tested.
#

BEGIN {
        # Change this to determine how many messages go into
        # each chunk. I determined the default value (80) empirically.
        SZ=80;

        ZEROES="00000";
        n=SZ;
        fnum=0;
}

/^From .*@.*:..:.. / {
        n++;
        if (n>80) {
                fnum++;
                n=1;
                if (fnum>1) {
                        close(fname);
                };
                fn1=ZEROES fnum;
                fn2=substr(fn1,length(fn1)-3);
                fname="split-" fn2 ".mbox";
                print fname;
        }
}

{
        print $0 > fname;
}
************************ END LISTING *************

Note: Per my other message on this topic, this program will *NOT* fix the
problem with messages containing the regexp '^From ' in the body text. You
need to do that *before* running this script.

I have now run this (successfully) on my largest archive, which contained
987 messages.

Some versions of awk (and I know that the default version from Solaris 8 is
one of these) have trouble handling more than a small number of output files,
even though the program closes each when it's done writing to it. You have
to install the GNU awk on these systems for this script to work correctly.

If anyone adds additional features to this, I'd appreciate a copy under GPL
so that your features get into my source file. Feel free to credit yourself
in the comments if you enhance this thing. Usual GPL etiquette. :-)

Scott

-- 
-----------------------+------------------------------------------------------
Scott Courtney         | "I don't mind Microsoft making money. I mind them
courtney at 4th.com       | having a bad operating system."    -- Linus Torvalds
http://www.4th.com/    | ("The Rebel Code," NY Times, 21 February 1999)






More information about the Mailman-Users mailing list