Graham's spam filter

Oren Tirosh oren-py-l at hishome.net
Thu Aug 22 04:24:39 EDT 2002


On Wed, Aug 21, 2002 at 10:48:59PM -0700, Erik Max Francis wrote:
> > What this program momentarily tries to implement is a client/server
> > based protocol with authentication that allows some program to contact
> > the server for classifying text that is passed in, working around the
> > limitation that was discussed on the mailing-list that it is quite bad
> > for response time to always have to reload the database on scanning.
> 
> I don't that this is necessarily true; certainly and without a doubt,
> reloading the _entire_ database each time is a non-starter.  The
> possibility of using a gdbm or similar database system might shorten
> those times to very reasonable amounts, but this is something I haven't
> researched yet.

Reloading the entire database is not necessarily a non-starter. If the
database is represented as some kind of hash table in a linear memory block
without using any pointers it can be mmapped.  The page cache will take
care of the rest.  I think this is easier to implement and manage than a 
client-server solution. I won't be surprised if it's faster, too.

This assumes that updating the probabilities database is a batch operation 
done periodically that creates a new databsae and then does a rename and
unlink. 

	Oren




More information about the Python-list mailing list