[Mailman-Developers] Idea

John Viega viega@list.org
Sat, 13 Jun 1998 15:50:53 -0700


On Sat, Jun 13, 1998 at 05:55:41PM -0400, Corbett J. Klempay wrote:
> 
> - how big of an archive should this be scalable to?

I'd say as big as possible.  I don't know if I can give a better
answer to that one.  Target the most heavily trafficed mailing list
you've been on.

> I'm thinking this because some models (like the vector model used for my
> project) get good accuracy, but suck as far as resource usage (like our
> engine dealt with ~2000 text documents from an online database and would
> suck up ~80 MB of RAM per query, and would take almost 20 seconds just to
> load the pre-indexed document clusters from disk; the query itself only
> takes like 1-2 seconds on a K6-233, but the startup time of 15+ seconds
> blows).  With large corpa, it might be necessary to implement some
> persistence; it just takes so annoyingly long to load even pre-indexed
> stuff from disk (and our queries were on a K6-233 with 128 MB;
> heheh...think how a P-90 with 32 MB would fare :) (or worse yet!)

Well, you have several options.  You could keep a persistant server
up, but I wouldn't make it a requirement, perhaps an option.  I think
that if complex search capabilities aren't desired, the grep libraries
would be an OK first pass.

> - What kind of structure does Pipermail store the archive in?  

This is the biggest problem right now.  Andrew says you can plug in
any sort of back-end you want for a database, as long as it can handle
a tree-type structure.  Unfortunately, the only such backend
implemented is not portable.  Everything needs to work out of the box
w/ a vanilla Python installation.  I was thinking that someone could
write a backend that uses the file system for that tree structure...

> - Did you have any idea about what kind of search interface?  Were you
> thinking a text field with Boolean capability, or just letting them throw
> some words in the field and see what sticks?

Well, for a first pass, something simple will do, but the nicer you
can make it, the better off we'll be...

> - Were you thinking of an engine that has an indexing process that runs
> via a cron job or something, or something much simpler? (like one that
> just brute force searches through the text of the archives for each query;
> that would be slow as hell if the archive was large, but would take no
> additional disk space and wouldn't really require persistence). 

It'd be nice to have a per-list setting for this one.  For most lists,
something simple would do, but I run a few where indexing would
certainly be better...

John