Graham's spam filter

Heiko Wundram hewu5001 at stud.uni-saarland.de
Thu Aug 22 10:50:01 EDT 2002


Am Don, 2002-08-22 um 10.24 schrieb Oren Tirosh:
> On Wed, Aug 21, 2002 at 10:48:59PM -0700, Erik Max Francis wrote:
> > I don't that this is necessarily true; certainly and without a doubt,
> > reloading the _entire_ database each time is a non-starter.  The
> > possibility of using a gdbm or similar database system might shorten
> > those times to very reasonable amounts, but this is something I haven't
> > researched yet.
> Reloading the entire database is not necessarily a non-starter. If the
> database is represented as some kind of hash table in a linear memory block
> without using any pointers it can be mmapped.  The page cache will take
> care of the rest.  I think this is easier to implement and manage than a 
> client-server solution. I won't be surprised if it's faster, too.

Might be faster, using an mmapped gdbm database or the like. I tried
using gdbm directly, but under high load, the program responds too slow
for my taste and also burns up much too much CPU-time (we don't have
access to better hardware than a K62-400 as our main server doing
everything... :))

Another thing that made me consider a client/server based solution is
the fact that you can then build a central probabilities database; this
(I think) solves many concerns that people have raised about training
the algorithm.

Of course a central database can only be useful in a closed unit; it
would be pointless to share my data (which mainly consists of german
spam) with someone who lives in the U.S., as german spam should be quite
unlikely for them.

> This assumes that updating the probabilities database is a batch operation 
> done periodically that creates a new databsae and then does a rename and
> unlink. 

That's how it's done in the current model server.

Well, I'll see what comes out of my efforts. Maybe it'll actually prove
to be useful.

Yours,

	Heiko Wundram
	Netzwart Wohnheim-D
	Universität 18 - Zimmer 2206 - Saarbrücken





More information about the Python-list mailing list