[Mailman-Users] Importing large archives ... design limit hit, and possible bug

Scott Courtney courtney at 4th.com
Sun Jun 2 03:35:38 CEST 2002


Hi, folks

I think I have found a bug in bin/arch, but I imagine someone has found it
before. Also, I have run into an architectural limit and would like to
change a constant to fix it, if possible.

*** Part One: Architectural Limit

I am trying to import large numbers of messages (several hundred, to as many
as 900 for one of my lists) into Mailman from mbox files. I can only do about
80 at a time with the "arch" program. I have written an awk script which breaks
my large mbox files into 80-message chunks, then I use a "for" loop from the
bash shell to process all of these.

So far the best result has come from starting with the earliest chunk, then
using "cat" to append the next chunk to that and re-running "arch", and so on
until all the chunks are added. The command sequence I'm using is:

for a in <mylist>-split-*.mbox; do
     cat $a >> archives/private/<mylist>.mbox/<mylist>.mbox
     bin/arch <mylist>
     cron/nightly_gzip <mylist>
done

This works, though it is cumbersome and slow.

I'm not a Python guru but do know C. I'm running on a machine with *lots* of
RAM. Can someone point me to which module I need to edit so that I can increase
the constant for an array size somewhere such that this thing will handle more
than 80 messages? I figure if I know where to look I can puzzle out the code
enough to make a simple change like that. But it will take days to find it in
code of an unfamiliar language with many modules.

The FAQ, by the way, does not (and should) address this problem. It only
instructs one to replace the archives/private/<mylist>.mbox/<mylist>.mbox
file, then run bin/arch. That only works for small numbers of messages.

I'll make this proposal: Someone point me in the right direction to solve
this thing. When I've got it working, I'll write an entry to submit for the
FAQ to document this for the next person, as a way of contributing to the
Mailman user community.

*** Part Two: Possible Bug

Sometimes bin/arch reaches a certain point and just....stops. No errors, but
no further messages are added. The "raw text" version of the archives grows
in size, though, as the same messages get added again and again after this
sticking point (before that, everything is well).

I did some investigation and have found that the problem occurs when a normal
text line in the body of a message happens to begin with the string "From ".
It appears bin/arch is not being very smart about recognizing the beginning
of a new message. I think what is needed is to add a more detailed parsing
regexp to the code that determines where one message ends and another begins.

I've worked around this on my own system by using some manual greps to find
the problem strings, then fixing them by using vi to insert a blank ahead of
the word "From" on these lines. It's cumbersome, but so far my success rate
is 100%.

Anyone have a better idea? As with the other problem, I'm willing to document
this for the FAQ if in fact it is a FAQ (pun intended).

Kind regards,

Scott

-- 
-----------------------+------------------------------------------------------
Scott Courtney         | "I don't mind Microsoft making money. I mind them
courtney at 4th.com       | having a bad operating system."    -- Linus Torvalds
http://www.4th.com/    | ("The Rebel Code," NY Times, 21 February 1999)






More information about the Mailman-Users mailing list