[Mailman-Developers] Re: Future of pipermail?

Chuq Von Rospach chuqui@plaidworks.com
Tue, 21 Nov 2000 23:54:04 -0800


At 2:23 AM -0500 11/22/00, Bill Bumgarner wrote:
>Yes-- all as true with a database system... but require the addition 
>of something
>other than just a vanilla HTTP server to implement the "get the data" part.

So does running mailman. So does running a search engine on the 
archives. We've already committed to adding a bunch of stuff to 
vanilla HTTP server (heck, what started this whole discussion, 
WebDAV,is adding something ot a vanilla HTTP server This is a 
strawman -- we can't do anything close to what we need with a vanilla 
HTTP server, so putting down one thing because it doesn't meet that 
goal is wrong. NOTHING useful meets that goal)

>Everything on the list above save for the "with little addition" 
>item can be done
>w/an out of the box apache.

um, arguable.
>
>    Obviously, this does not include the administrative
>part-- the piece that does per list/per-site configuration requires some
>additional work regardless of a database or filesystem (or WebDAV) 
>backing store.

oh -- everyting, well, except, um... (giggle)

>Another advantage to a filesystem based archival arrangement is that it is
>*extremely* easy to write a random shell script or two that prunes data from
>crontab, rebuilds indices, moves stuff around, reformats things, 
>archives off to
>archived archives, etc...

have you ever worked with NNTP? Because you're reinventing something 
the NNTP people have spent years designing out of their systems, 
because it has horrible scaling capabilities and is horrible resource 
inefficient. Some of us were arguing that this was a bad design model 
over a decade ago, and it's been proven by NNTP very nicely.

>  Yes-- of course-- all of these operations can be done
>with a database backing store, as well, but it is significantly more 
>difficult to
>develop such tools and install them into the system.

It is? have you done it? I have... (www.apple.com/signmeup). It's not 
as bad as you think.

>  Likely, this is somewhat of
>a perception issue, but most administrators will not hesitate to 
>toss together a
>shell script to manage an archive of stuff, but will think twice 
>before diddling a
>database.

yes, that's a eprception issue, and it's also a strawman. I don't buy 
that for a second. Numbers, please. you can't throw a strawman like 
this out without backing data.


>Very large and very expensive databases scale very well-- MySQL does not.

sure it does. But even more important -- if I outgrow MySQL and need 
to throw in a really big muther database like Oracle, I can fairly 
easily. If your filesystem store system is outgrown and you need to 
add capacity to scale it, how do you do it? you rearchitect from 
scratch, probably going to a database-centric design.

>   I
>agree that for truly huge, high traffic sites,  moving away from a 
>pure filesystem
>approach-- moving everything into, say, an ES10000 running one of 
>the magawhompus
>Oracle/Sybase license-- would be the way to go.

Oh, please. I'm running big megawhompus stuff on E250's and E450s 
(for my news servers), and MySQL no problemo. My big muther lsit 
server is MySQL on an E250, handles 28,000 database updates a day on 
average, sends out a few million emails a week, and spends most of 
its time idle (the only huge CPU sink I have is bounce processing, 
but then it take stime to process 300 megabyte bounce files)

>   But I don't think that is what
>90% of the Mailman users are going to be using the system for and 
>requiring-- or
>even encouraging the use of-- a database as a backing store for 
>messages will not
>add value to those people.

nor is that what I'm proposing. I don't know why you're so database 
averse, to be honest. but I think it's a personal aversion on your 
part, not a legitimate teechincal issue.

>Considering most of the usage of an archive of messages....
>
>     - write operations are infrequent, modifications pretty much non existant
>
>     - retrieval tends to be extremely sporadic and is generally *not* evenly
>divided across the archive-- a relative few messages receive most of the hits
>
>     - there are extremely limited ways of viewing the data;  by author, date,
>thread, subject.... with MOST views focused on thread.
>
>     - indices are periodically updated
>
>... I still believe that a webserver-reading-files-from-filesystem 
>is going to be
>loads more effecient than a
>webserver-reading-data-from-client-server-connection-to-app-adaptor-reading-data-from-client-server-connection-to-multiuser-database.

Sorry, but my reseach doesn't agree. And I'm not sure I agree with 
your idea of what goes on in archives, but I don't have numbers to 
back myself up on that. It also ignores ancillary advantages of 
databasing this stuff -- like the easy addition of content searching 
and the ability to write really good, customized search capabilities. 
In your way, you have to build technology (or borrow it, like HtDig) 
to get that, so anything you might possibly save resource or 
development wise in your model gets eaten by trying to do searching 
right -- and one thing I *have* found from my users is that archives 
without good search tools are pretty useless to them. So I consider 
archive/search a single key integrated module, even if the 
technologies are seeparate, and databasing stuff allows me to build a 
lot of search power into the system, where your setup doesn't -- you 
have to go and do it the hard way (and I've done that, and it 
sucks...)

>
>Yes-- if you are doing textual searches *directly* against the filesystem,
>filesystems are a lose.  But a database is not a whole lot better 
>unless you blow
>the relatively major

Not true.

>Regardless of where the data is stored, searches would typically 
>have to be backed
>by some kind of a indexing store-- be it in a dabase, a btree type solution

but one of the nice things about databases is they're written to 
build indexes for you -- and the guys who write those indexing 
routines are experts. so you leverage their strengths.

>solutions available.   Building that index from a database or from a 
>filesystem
>isn't really that much different in terms of difficulty though a 
>number of folks
>would find the problem of walking a filesystem more approachable 
>than walking a
>database.

I've done both. I disagree. filesystem-centric systems are system 
intensive resource hogs that are fine for small to medium 
installations but scale poorly, and which require re-architecting 
when you outgrow them. Databse centric ones might be a little more 
work up front, might be somewhat overkill for tiny sites, but even 
for small ones, tend to break even, and scale upward basically 
infinitely because you can swap in bigger horses based on your budget 
-- but you don't need those horses. You can do really good stuff with 
open source tools. All you need is some good database design. not 
even great database design.

>With that said, I do feel strongly that an abstraction layer does little good
>without concrete implementations of the adaptors underneath.   As such, I
>volunteer myself to write the adaptor to WebDAV and I'll tentatively volunteer
>Chuq to write the adaptor that speaks to a database backing store.

Actually, chuq's planning on architecting large parts of mailman 3.0, 
assuming Barry gives the Okay. but I volunteered for that weeks ago, 
and have been whacking on concepts on and off since.  And since a lot 
of what I hope to do in mailman wll be leveraged off work I've 
already done or am planning to do for Other Things I Can't Admit To, 
a lot of it is beyond "I think this will work, maybe" in thought...

Since 1994, I've architected and implemented about half a dozen 
production e-mail systems, from really tiny things based on common 
tools (first was Listproc, then majordomo, now Mailman, for the three 
generations of my 'generic' servers), to really bastard big things to 
corporate email systems... And in the next six months, I'm rewriting 
my big muther almost from scratch to take it to the 25,000,000 
subscriber capability and double delivery speed (again), and then 
we're probably redoing the internal corporate (which supports ~15,000 
lists) to handle list lookup and delivery/authentication on demand 
via LDAP to the corporate databases (right now, we 
snapshot/download/munge the data...). So a lot of where I think 
Mailman ought to go is taking pieces of some of these boxes (with my 
bosses kind permission) and making it part of Mailman...

And trust me, it'll be a long time before even my biggest email 
system needs Oracle or an ES10000. you can do wonders with a stack of 
Ultra 5s slaved to a decent sized box like an E250 (in fact, in many 
cases, taht's a lot better). I apologize if this sounds like I'm 
pulling rank, but -- you keep saying that things can't be done that I 
know are wrong, because I already have in some form or another, or 
that I've already done the design (and/or prototype for and know how 
it'll work.

chuq

-- 
Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com)
Apple Mail List Gnome (mailto:chuq@apple.com)

The vet said it was behavioral, but I prefer to think of it as genetic.
It cuts down on the liability -- Get Fuzzy