[Mailman-Developers] Re: Security

J C Lawrence claw@kanga.nu
Wed, 25 Oct 2000 11:14:40 -0700


On Tue, 24 Oct 2000 09:38:56 +0200 (CEST) 
Dan Ohnesorg <dan@feld.cvut.cz> wrote:

> On Mon, 23 Oct 2000, J C Lawrence wrote:
>> I'm certain the bug is not in Apache as it also occurs on post
>> passing straight to the list without going thru moderation.  It
>> is possible it is in Exim, tho I'd be extremely surprised.  For
>> one I've reinstalled all binaries from known good sources, and
>> have MD5ed all Exim files against both known good sources and the
>> copies installed on other happily running machines.  It is also
>> unlikely that the bug is in the kernel as I've reproduced the
>> problem with kernels built on other (untouched) machines and then
>> installed on the offending machine, and on kernels built locally
>> from cryptographically verified source balls.

> I am running mailman from the beginning of mailman developing
> (John Viegas version, Im not sure if the name is correct, but
> cheers John). 

As have I.

> It runs on SMP machine and I have never such a problem. 

Ditto.  I currently have various versions of Mailman running on
three SMP systems without problems.  The fact that this particular
(other) SMP system is having Mailman problems does not seem related
to SMP.

> I have seen problems with forking of sendmail into 1000 processes
> while delivering messages in big (500 members) lists after
> comminting pending request. 

This is a common MTA configuration issue, most often seen with QMail
FWLIW.  MTA tuning, especially as mail volumes grow, is a bit of an
art.  There was some interesting discussion on this area between Me
and Chug on this list a couple months back you might want to look
at.

> But my kernel was never confused from this like Yours. I would
> say, Your memory chips are wrong. User space program can never
> corupt filesystem.

It is possible I have bad RAM.  It seems rather unlikely however
(see below),

You are missing data from the beginning of the thread.

I had a disk die (it held various mail archives).  In dieing it not
only took down the system, but also succeeded in trashing various
bits of other filesystems on other devices.  Among the files trashed
were the tripwire database and its backups.  This was not apparent
when I replaced the drive and restored all relevant files from
known-good/secure backups.

Given the new drive, the system remained unstable, crashing
frequently (uptime measured in single digit hours).  This is as
compared to a previous uptime of near 200 days.

I then replaced every binary on the system from original verified
packages, encluding the kernel (built a new kernel locally from
cryptographicslly signed and checked sources, using a new
hand-checked .config).

Crashes continued and seemed to be coincident with mail travelling
thru Mailman, either thru the weba approval process, or direct
through to the exploder (no approval)

The MTA at this point appeared to be happy.  Several tens of
thousands of messages a day travel through that system, and were
successfully passing through the system between crashes (my
secondary MXes were dumping mail onto the system at a rate of well
over 2K messages per minute upon rebooting from an extended crash --
which the system took quite happily).  All crashes were observed to
be time coincident with Mailman mail activities.

Suspecting bad disk blocks and potentially other hidden filesystem
troubles I then replaced all filesystems (except / and /boot) on the
system with journalling filesystems (ReiserFS), doing a surface
check on all partitions before putting the new filesystems on.  I
again replaced every binary on the system from confirmed correct
packages, and built a new kernel on a known secure machine from
crypographically signed and checked sources.  Additionally I double
checked by doing MD5Sum signature comparisons of key binaries on the
target system, with specific attention paid the the mail system,
against a known secure system.  They matched perfectly.  

Finally I ran a semi-burn-in on the system: leaving it over night
continuously building kernels AND using SCP to send those kernels to
and from a remote box (to hit the network stack) with MD5 checks on
each end AND sending an average of 25K mail messages per minute to a
another system on the local net (100base-T connected).  The next
morning there was not a single error in any file, all SCP copies had
compleated without error, all MD5 checks were passed, and neither
MTA listed any problems (the messages themselves were
bit-bucketted).

I then rolled the box back into production.

Crashes continued.

They also continued after building a new kernel on the target
machine -- from similarly verified sources (needed a slight tweak).

I then replaced mailman from known good sources.

Crashes continued.

I've now removed all bytecoded files in the Mailman installation.
Additionally I've hand unrolled and re-rolled the config.db for one
of the lists that appears to be creating troubles.  The unrolled DB
looked good.  Additionally, as I had in excess of 30K messages in my
MTA spool pending delivery to assorted unresponsive remote systems
and I suspected that a corrupted queue file might have been causing
problems with Exim (which does briefly run as a privileged user), I
hand moved all spool entries from the target system to another
known-stable/secure system.

We'll see what happens now.

-- 
J C Lawrence                                 Home: claw@kanga.nu
---------(*)                               Other: coder@kanga.nu
http://www.kanga.nu/~claw/        Keys etc: finger claw@kanga.nu
--=| A man is as sane as he is dangerous to his environment |=--