[Spambayes] Supporting new database type in classifier

Sat Feb 14 23:25:28 EST 2004

[Brad Clements]
> I'm working on a new type of storage that requires closer
> integration with classifier _getclues and _add_msg, _remove_msg.

You'll probably get better responses on the spambayes-dev list.

> For example, this code fragment in classifier._getclues:
>
>             # The all-unigram scheme just scores the tokens as-is.  A
>             Set() # is used to weed out duplicates at high speed.
>             clues = []
>             push = clues.append
>             for word in Set(wordstream):
>                 tup = self._worddistanceget(word)
>                 if tup[0] >= mindist:
>                     push(tup)
>             clues.sort()
>
> Would essentially be pushed into the database module. For
> efficiency, the database module must have the entire wordstream
> to work with.

I encourage you to work on a branch for now -- since most people drop most
ideas after a few weeks at most, I'm opposed to warping this part of the
code to cater to something as unlikely to be seen again as a
non-random-access database model.  If you work on a branch and demonstrate
astonishing results, great, then we'll junk all other storages and adopt
yours <wink>.

> _worddistanceget could be passed into the database as a callback,
> or the code could be replicated at the database level. That is,
> _worddistanceget calls _wordinfoget AND performs calculations. I'd
> prefer a function that accepts the token info (nham, nspam)
> and does the calculations w/o being coupled to _wordinfoget.
>
> Overiding _wordinfoget in a subclass doesn't work for me, because
> that function only gets called with one word at a time.
>
> I could override _getclues, but then I'd have to recreate the
> bigram stuff which is quite a lot.

It's less than 30 lines of code (half of it is comments).

> So, my first question is, could the bigram stuff be structured as a
> 'filter' before _getclues (modifying the wordstream) and before
> _add and _remove_msg?

The bigram stuff is already a filter before _add and _remove.  It could also
be done as a filter before _getclues, but not pleasantly.

> Second, what's the best way to restructure classifier so that a
> storage subclass can deal with entire wordstreams in one lump if
> it so chooses?

On a branch -- prove this is worth doing first, and don't worry about doing
it cleanly before that succeeds.