[Mailman-Users] "No subject" messages in archives

Mark Sapiro msapiro at value.net
Mon May 21 19:37:28 CEST 2007


Ivan Van Laningham wrote:
>
>Ah.  Now we're getting somewhere.  Here are some sample "From " lines:
>
>1)  From the current list.mbox (leading '> ' not part of actual line):
> > From Lizzelvin at aol.com Sun Mar 18 18:17:56 2007


This is a normal Unix From_


>2)  From the old mbox which I want to incorporate (leading '> ' inserted):
> > From "robyn m. fritz" <rfritz at nwlink.com>
>or
> > From Mochie at webtv.net (C Ryplansky)


These are non-standard separators


>
>And here is the _fromlinepattern:
>
>_fromlinepattern = r"From \s*[^\s]+\s+\w\w\w\s+\w\w\w\s+\d?\d\s+" \
>                    r"\d?\d:\d\d(:\d\d)?(\s+[^\s]+)?\s+\d\d\d\d\s*$"
>
>Now, I don't understand much of this pattern, but it looks to me as if 
>a) there's no provision for matching " or < or > characters; and
>b) some sort of date/time mark is required.


This pattern is used by cleanarch to try to separate a standard Unix
From_ separator from other lines that just happen to begin with "From
". It matches
"From " followed by whitespace-delimited fields containing
any non-whitespace - email address
3 alphanumercs - day of week
3 alphanumerics - month
1 or 2 digits - day of month
1 or 2 digits, colon, 2 digits, optional colon and 2 digits - hh:mm(:ss)
optional any non whitespace - time zone offset
4 digits - year

So yes, it looks for a single email address and a date in a specific
format. The email address can be bracketed - <user at example.com> and
doesn't really have to look like a valid email address, but it can't
contain whitespace, thus it can't have a 'real name' unless it has no
whitespace such as johnsmith<jsmith at example.com>.

This is only used by cleanarch. Pipermail doesn't care about the
contents of the From_ separator. It assumes any line that begins with
"From " is a separator and ignores the rest of the line.
 

>All the "From " lines are terminated with a \n, and all are followed 
>immediately by what look like valid message header lines, so I don't 
>think those are problems.  There do appear to be 1006 unescaped "From " 
>lines in the old mbox:
>
>$ grep '^From ' guppies-out.mbox | wc
>    46295  163728 1800087
>$ grep '^From: ' guppies-out.mbox | wc
>    45289  159710 1803623


This seems to indicate a problem, but still doesn't account for 5000
spurious archive entries.


>So, if I process the old mbox and convert the "From " lines without 
>dates into "From " lines without " and <> and add a date/time stamp, and 
>THEN run cleanarch, cleanarch should escape only the 1006 non-matching 
>"From " lines, and I should end up with an mbox I can combine with 
>March, April and May of 2007 from the current list.  Is that a correct 
>assessment?


That is correct, but if you can process the old mbox and identify which
"From " lines without dates are actually message separators, then you
should be able to identify which ones are not message separators and
just escape those. I.e. create your own archive cleaner specific to
this situation.

-- 
Mark Sapiro <msapiro at value.net>       The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan



More information about the Mailman-Users mailing list