[Mailman-Developers] Re: Future of pipermail?

Wed, 22 Nov 2000 02:23:28 -0500

Chuq Von Rospach wrote:

> >Alone, a basic filesystem served webserver gives us:
> >
> >     - effecient access to archives
> >
> >     - basic per-site, per-list authentication
> >
> >     - [with little addition] unified access/passwords between lists, etc...
> >
> >     - almost *zero* overhead with *very little* implementation cost
>
> which are all as true with a database based system. Except I don't
> agree withyou on the overhead and implementation costs on your end.
> Ask anyone who manages a decent-sized NNTP system -- filesystem
> centric storage systems don't scale to large data sets well at all.
> Databases do....

Yes-- all as true with a database system... but require the addition of something
other than just a vanilla HTTP server to implement the "get the data" part.
Everything on the list above save for the "with little addition" item can be done
w/an out of the box apache.   Obviously, this does not include the administrative
part-- the piece that does per list/per-site configuration requires some
additional work regardless of a database or filesystem (or WebDAV) backing store.

Another advantage to a filesystem based archival arrangement is that it is
*extremely* easy to write a random shell script or two that prunes data from
crontab, rebuilds indices, moves stuff around, reformats things, archives off to
archived archives, etc...   Yes-- of course-- all of these operations can be done
with a database backing store, as well, but it is significantly more difficult to
develop such tools and install them into the system.   Likely, this is somewhat of
a perception issue, but most administrators will not hesitate to toss together a
shell script to manage an archive of stuff, but will think twice before diddling a
database.

Very large and very expensive databases scale very well-- MySQL does not.   I
agree that for truly huge, high traffic sites,  moving away from a pure filesystem
approach-- moving everything into, say, an ES10000 running one of the magawhompus
Oracle/Sybase license-- would be the way to go.   But I don't think that is what
90% of the Mailman users are going to be using the system for and requiring-- or
even encouraging the use of-- a database as a backing store for messages will not
add value to those people.

Considering most of the usage of an archive of messages....

    - write operations are infrequent, modifications pretty much non existant

    - retrieval tends to be extremely sporadic and is generally *not* evenly
divided across the archive-- a relative few messages receive most of the hits

    - there are extremely limited ways of viewing the data;  by author, date,
thread, subject.... with MOST views focused on thread.

    - indices are periodically updated

... I still believe that a webserver-reading-files-from-filesystem is going to be
loads more effecient than a
webserver-reading-data-from-client-server-connection-to-app-adaptor-reading-data-from-client-server-connection-to-multiuser-database.
It is the same reason why publishing systems don't often put the images in the
database [instead, dropping URLs or other symbolic references into the database]--
modern operating systems [Linux, FreeBSD, OSX, Solaris] are very good about
caching data off of the disk in memory until that memory is needed for something
else.   A webserver sitting on such a system will typically map anything the
client asks for into memory and not reread it off the disk until something else
bumps it out of memory.   Even if the web server is cycling children, the
underlying filesystem cache isn't going to push mapped files out of memory unless
that memory is needed for something else.

Certainly, solution providers like Oracle and Sybase rolled their own webservers
that connect relativley directly to the database.   I don't htink very many
Mailman installations woould use this and, if they do, the cost of adding a
generic "accept an http msg w/an XML body describing an email and I'll deal
w/archiving it in the fashion most appropriate to my implementation needs" is
minor compared to the other costs likely inherent to such a project.

>
> >I feel *very strongly* that the archival solution-- whether it be
> >raw messages or
> >decoded messages-- should be centric to storing files in directories
> >and serving
> >files from directories.
>
> And I disagree, since you're tying to a single technology that works
> for some cases, but isn't a general solution, and limits other
> options. For instance, it's fairly easy to implement a lot of
> searching via DBI. it's a lot harder using a filesystem store.

Yes-- if you are doing textual searches *directly* against the filesystem,
filesystems are a lose.  But a database is not a whole lot better unless you blow
the relatively major [compared to most Mailman deployment budgets] $$$ on
something like Sybase's or Oracle's free text searching solutions.

Regardless of where the data is stored, searches would typically have to be backed
by some kind of a indexing store-- be it in a dabase, a btree type solution
[Faircom's C-Tree comes to mind], or any of the myriad of "search an index"
solutions available.   Building that index from a database or from a filesystem
isn't really that much different in terms of difficulty though a number of folks
would find the problem of walking a filesystem more approachable than walking a
database.

> >This is *not* to say that the DBI approach isn't the right way to go;  if a
> >generic DBI->filesystem, DBI->WebDAV, DBI->DB capable API were put
> >together and
> >was relatively hidden from the user and casual developer, it might
> >be a huge win.
>
> which is exactly what I'm arguing. And DBI is a well-known interface
> that ias easily accessible to anyone who wants to write an interface
> -- if the architecture is done right. And it's portable off of Unix,
> which is an issue for Mailman.

I think we agree on the architectural direction of Mailman, but disagree on the
relative merits of different backing stores.   As such, the above discussion is
very interesting, but largely moot.

Internally, it makes a lot of sense to use DBI as the abstraction layer between
Mailman and the thing that persists BLOBs, messages, and meta information.   The
underlying implementation is mostly irelevant.

With that said, I do feel strongly that an abstraction layer does little good
without concrete implementations of the adaptors underneath.   As such, I
volunteer myself to write the adaptor to WebDAV and I'll tentatively volunteer
Chuq to write the adaptor that speaks to a database backing store.   Our fierce
sense of technical pride and intellectual competition will guarantee that Mailman
version X.X will ship with first class ipmlementations of both adaptors. :-)

>
> >  > Truly. And if we can support BLOBs in DBI, well, we don't have to
> >>  write anything to disk and can generate an entire message out of a
> >>  DBI database -- portable to any decent database.
> >
> >But an order of magnitude less effecient than downloading the BLOB
> >off of disk via
> >a webserver!
>
> Not necessarily. And is the efficiency important? A lot of time is
> wasted in computer development optimizing the wrong things....

Agreed.   Never optimize until you know where performance is going south...

> >I feel strongly that abstraction is key, but that we should also
> >provide decent,
> >production quality, implementations of solutions to the very same
> >set of problems
> >for which we build the gneric abstracted/modularized APIs.
>
> agreed.
>
> >If Mailman is not fully functional "out of the box", then people
> >will ignore it.
> >However, if it isn't also flexible enough to be integrated into their weird
> >environments (because every server on the web has weirdness),
> >they'll bitch and
> >moan until they find something else that doesn't solve their problem to B&M
> >about....
>
> it has to be functional out of the box, but we have to make sure we
> define "functional" properly, too. You can't be missing features but
> doing the "kitchen sink" think just in case is wrong, too.

Agreed.

b.bum