[Mailman-Developers] Requirements for a new archiver

Peter C. Norton spacey-mailman at lenin.nu
Wed Oct 29 14:54:20 EST 2003


On Wed, Oct 29, 2003 at 07:45:53PM +0100, Brad Knowles wrote:
> At 1:28 PM -0500 2003/10/29, John A. Martin wrote:
> 
> > Hmm... Maildirs.
> 
> 	Not.
> 
> 	From 
> 	<http://www.washington.edu/imap/documentation/formats.txt.html>:

[deletia]

I don't know why a reasonable person would cite documentation
pertaining to UW-IMAP, a server that has been a standards, security
and performance bummer.

Why not cite http://www.courier-mta.org/mbox-vs-maildir/?

<quote>

Painting "just about" every filesystem in existence with the same
brush, and assuming that every filesystem works pretty much in the
same way, is very misleading. Many contemporary high performance
filesystem are designed explicitly for parallel access. For example,
consider the SGI XFS filesystem:

    The free space and inodes within each AG are managed independently
    and in parallel so multiple processes can allocate free space
    throughout the file system simultaneously.[2]

It took me about 6 months to write the first revision of the
maildir-based Courier-IMAP server. The absence of maildir support in
the UW-IMAP server is the reason I wrote it. Many people have found
that it needed less memory, and was faster than UW-IMAP. Many people
observed that upgrading to Courier-IMAP lowered their overall system
load, and increased performance. Large mail clusters with a
network-based fault tolerant, scalable, architecture frequently have
problem deploying mbox-based mailboxes, due to many documented
problems with file locking (file locking is required for mbox-based
mailboxes) with network-based filesystems.[3] As referenced in [3],
maildirs have no issues with NFS (the most common type of a
network-based filesystem) since maildirs do not use locking.

After looking around for some time, I did not find any independent
benchmarks that directly measured the relative performance of mboxes
and maildirs. Therefore I decided to run some actual benchmarks
myself. I defined the test conditions according to UW-IMAP server's
documentation. I created a test environment that stacked the deck in
favor of mboxes. This was done in accordance with the claimed
shortcomings of maildirs as stated in UW-IMAP server's documentation,
in order to accurately measure the magnitude of the claimed problems.
</quote>

and at the end:

<quote>

The final conclusion is that -- except in some specific instances --
using maildirs will be just as fast -- and in sometimes much faster --
than mbox files, while placing less of a load on the rest of the mail
system. The claims in the UW-IMAP server's documentation regarding
maildir performance can be supported only in certain, specific, very
narrowly-defined conditions. There is no simple answer on which mail
storage format is better. A lot depends on many variables that vary
widely in different situations. Besides the raw benchmarks shown
above, other factors include the mail server software being used, what
kind of storage is being used, and the available network
bandwidth. The final answer depends on all of the above.

</quote>

[flame-bait deleted]

>       A database (such as used by Exchange) is really a much better
>  approach if you want to move away from flat files.  mx and especially
>  Cyrus take a tenative step in that direction; mx failed mostly because
>  it didn't go anywhere near far enough.  Cyrus goes much further, and
>  scores remarkable benefits from doing so.
> 
>       However, a well-designed pure database without the overhead of
>  separate files would do even better.
 
It always confounds me that people will go for database voodoo and
deride filesystems when a filesystem is a highly specialised database
in and of itself.  Putting things that are in a filesystem into a
database offers the power and flexability of querying, but certianly
should not be done for the sake of speed (assuming the
filesystem-based implementation meets whatever other requirements are
present).
 
> 	Of course, we all know about the database problems of Exchange, 
> and how Exchange admins have to frequently shut everything down and 
> clean their databases, how often they crash, how often they 
> completely trash all e-mail for all their users, etc....

Which is a good lesson about databases: because of their flexability,
they cannot be qa'd to cope with all of their uses without being put
into production and losing data and being subsequently fixed.
Filesystems, which have a more narrowly-defined scope, tend to suffer
this less.  Thats why database logs that live on filesystems are used
for data recovery when a database eats itself.
 
> 	I submit that the reason for this is the combination of crappy 
> Microsoft-style programming and the fact that no database handles 
> BLOBs well.  Even top-notch programmers have real problems with these 
> kinds of implementations -- I am intimately familiar with the 
> database implementation methods used in the AOL mail system, and 
> suffice it to say that this is a really, really hairy nightmare that 
> you do *NOT* want.

Databases aren't meant to be storage for abstract binary data.
They're meant to be a searchable index of data of types they
understand.  

Assuming I had a clean slate to start a database project for a mail
store, personally I'd much rather prototype it in something like
postgresql where I could add data types to deal with email.  I could
then make header types, text types, mime types classes, etc.  Then I
could test to see if it was a good idea to implement it.
 
> 	That said, storing meta-data in a real database and then using 
> external filesystem techniques for actually accessing the data, 
> should give you the best of both worlds -- the speed of access of the 
> database, and the reliability and well-understood access and backup 
> mechanisms of filesystems.

I think using a standard sql database for doing mail operations is
asking for trouble.  Standard databases don't know how to parse
rfc822/2822 headers and that means that you've got to either write a
whole lot of stored procedures in a clunky query language (or
java!?!?!) and then maintain it, or you've got to do it all in the
imap/pop3/whatever server which means a whole lot of yammering traffic
between the database and the I/P/W server all the time, which == slow.

-Peter

-- 
The 5 year plan:
In five years we'll make up another plan.
Or just re-use this one.




More information about the Mailman-Developers mailing list