[Mailman-Developers] Idea

Corbett J. Klempay Corbett J. Klempay" <cklempay@chimera.acm.jhu.edu
Sat, 13 Jun 1998 17:55:41 -0400 (EDT)


Hey, I just saw the comment about integrating archive search
functionality in 1.0 or another not-so-distant release.  I was thinking
that maybe I could use this as a chance to learn Python; I just finished
taking Info Retrieval & Text Understanding this past semester, and ended
up doing a vector search engine for the final project.  I'm not saying
that I can for sure do this; I think I might see if I can find some time
to screw around with it, and if it develops ok, then bonus. Some
questions, though:

- how big of an archive should this be scalable to?

I'm thinking this because some models (like the vector model used for my
project) get good accuracy, but suck as far as resource usage (like our
engine dealt with ~2000 text documents from an online database and would
suck up ~80 MB of RAM per query, and would take almost 20 seconds just to
load the pre-indexed document clusters from disk; the query itself only
takes like 1-2 seconds on a K6-233, but the startup time of 15+ seconds
blows).  With large corpa, it might be necessary to implement some
persistence; it just takes so annoyingly long to load even pre-indexed
stuff from disk (and our queries were on a K6-233 with 128 MB;
heheh...think how a P-90 with 32 MB would fare :) (or worse yet!)

(for the really curious people, the engine is at
http://www2.acm.jhu.edu/projects/calliope)

- What kind of structure does Pipermail store the archive in?  Maybe I
should contact someone (maybe whoever runs the lists at python.org?) who
runs a larger Pipermail archive and see about getting an archive of a
treefull of archived articles

- Did you have any idea about what kind of search interface?  Were you
thinking a text field with Boolean capability, or just letting them throw
some words in the field and see what sticks?

- Were you thinking of an engine that has an indexing process that runs
via a cron job or something, or something much simpler? (like one that
just brute force searches through the text of the archives for each query;
that would be slow as hell if the archive was large, but would take no
additional disk space and wouldn't really require persistence). 

-------------------------------------------------------------------------------
Corbett J. Klempay			         Quote of the Week:
http://www2.acm.jhu.edu/~cklempay  "Economists are still trying to figure
				    out why the girls with the least
				    principle draw the most interest."

PGP Fingerprint: 7DA2 DB6E 7F5E 8973 A8E7  347B 2429 7728 76C2 BEA1
-------------------------------------------------------------------------------