[Mailman-Users] Archive merge and search

Mark Sapiro mark at msapiro.net
Mon Nov 10 05:10:59 CET 2014


On 11/09/2014 06:37 PM, Hal wrote:
> 
> Investigating the MBOX files in a text editor I found the problematic
> ones to have headers starting with ">From " (without the quotes) which
> the working ones didn't, so I removed all those lines from a couple of
> MBOX files, imported into the Mailman archives and all looked fine!
> Obviously I can't check every single posting, so does my discovery and
> solution sound feasible?


Mailman's bin/arch is very liberal (actually too liberal, thus
cleanarch) in what it accepts as a "From " separator in a mbox.
It assumes that any line beginning with "From " is the start of a new
message. Messages should look like

>From user at example.com Sat Aug 16 15:10:02 2014
Some-Header: ...
Next-Header: ...
...
Last-Header: ...

first body line
next body line
...e
last body line

i.e. the first line of each message is of the form

>From user date

This is followed by headers which have a name ending with a colon and
which may be folded into multiple lines as long as the continuation
lines begin with a space.

The headers are terminated by an empty line (on the wire, the sequence
<CR><LF><CR><LF>) and the rest up to the next "From " separator is the body.

Sometimes, in some mbox formats, message bodies have lines that begin
with "From ". This confuses bin/arch into thinking a new message starts
there. bin/cleanarch looks at lines that begin with "From " and if they
don't look like

>From user date

or aren't followed by a header-like line, it prefixes the line with > so
it won't confuse bin/arch.

If however your mbox has true "From " separators that don't look like

>From user date

(perhaps because the date format is wrong or some other reason),
cleanarch will 'escape' them which would be wrong.

So cleanarch may have munged your mboxes or they may be weird for other
reasons. In any case, I think you need to look at the original mboxes,
maybe with something like "grep '^From '" (or maybe "egrep '^>?From '")
to verify that there is some kind of unescaped "From " line at the
beginning of each message and that there are no unescaped "From " lines
in message bodies and possibly fix the problems manually.

Note that removing all lines starting with ">From " may be problematic.
It could remove a body line if that body line originally started with
from. On the other hand, if such a line was in the headers, it would
cause premature termination of the headers which could be your issue,
but I would wonder how such a line got there.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan


More information about the Mailman-Users mailing list