[Spambayes] Using Spambayes with MySQL

Meyer, Tony T.A.Meyer at massey.ac.nz
Sat Aug 23 15:05:45 EDT 2003


> Does SpamBayes have any way (yet) of using MySQL databases?
> If so, what do I need to do to set this up?
[The following is an answer I sent in response to an off-list mail
essentially asking the same thing, in case others are interested.]

Hi David,

> (Apologies at the moment for mailing you directly, but with
> the current spamstorm, I'm sure you'll understand).

Indeed.

> I'm wanting to use SpamBayes with a MySQL back end, and add
> some ability to have a per-user training base.
> 
> On RTFS'ing, I see a base class storage.mySQLClassifier.

BTW I'm not sure if the source mentions it or not, but the SQL stuff is
very recent.

> But, hammiefilter.py uses hammie.open, which is hardwired to
> only choose between db file and pickle storage objects.
[...]
> So my questions to you are:
> 1) Do you advise the subclassing approach?

No, but only because the patching approach is better.

> Are the underlying
> classes stable enough for me to do this, or are my child 
> classes prone to breaking as Spambayes and its native classes change?

The underlying classes should be stable enough, although the whole
module may be renamed at some point.  The classifier, storage, and
tokenizer classes in particular should be *very* stable.

> 2) Or do you suggest I hack the source, and send patches
> against latest CVS?

Yes, I think this is the best path.  If you waited long enough this
would probably happen anyway, but given that (AFAIK) none of the main
developers has much interest in SQL based classifiers, it's unlikely to
be soon.  If you open a tracker on sourceforge and put the patches there
either Skip or I will integrate them in.

If it works with what you have in mind, a tentative specification has
been discussed, as follows:
 * if [storage]use_persistent_database is false, use a pickle.
 * else if [storage]persistent_storage_file doesn't contain "::", use a
dbm
        (this might change to follow the below at some point)
 * else split [storage]persistent_storage_file into two on the first
"::"
 *      the bit before the "::" is the storage type ("mysql", "pgsql",
etc)
 *      the bit after the "::" is passed to the appropriate classifier
        (containing the user name, database name & location, and so on).

The code I used in pop3proxy looked like this:
"""
        if self.useDB:
            if '::' in filename:
                sql_types = {"pgsql" : storage.PGClassifier,
                             "mysql" : storage.mySQLClassifier,
                             }
                sql_type, rest = filename.split('::', 1)
                if sql_types.has_key(sql_type.lower()):
                    self.bayes = sql_types[sql_type.lower()](rest)
                else:
                    # yikes! raise some sort of NoSuchClassifierError
                    pass
            else:
                self.bayes = storage.DBDictClassifier(filename)
        else:
            self.bayes = storage.PickledClassifier(filename)
"""

(The mySQLClassifier __init__ could use improvement since it currently
expects each element to be split by a space, which means that the
database info can't have a space.  The API here is completely unstable;
feel free to come up with whatever you think is best).

If you really wanted to do things nicely, then (IMO) it would be great
to have a function (probably in storage.py) that would do this for you,
so that hammie, pop3proxy et al could all use the same one.  Something
like "open_storage(use_pickle, data_source_name)".

> Be assured with the source-hacking scenario, I respect that
> my 'Spamish Inquisition' package is controversial, and will 
> keep it completely separate from any code I contribute.

Thanks :)  For the moment at least, I don't think it would go down well
even just having it in the contrib directory.  OTOH, it's easily
accessible from the website.

> With
> any patches I contribute, I will take utmost care to extend 
> intelligently and not break any existing functionality.

Thanks - that makes going over the patches and checking them in that
much easier...

Cheers,
Tony



More information about the Spambayes mailing list