[spambayes-dev] Re: Pickle vs DB inconsistencies

Wed Jul 9 11:16:04 EDT 2003

I've been putting off responding to this thread for a while,
but it's now time for me to chip in, too.

In message:  <20030704135718.GB1127 at cthulhu.gerg.ca>
             Greg Ward <gward at python.net> writes:
>
>My biggest gripe with spambayes is the inconsistency of the command-line
>tools.  They're scattered around the CVS tree randomly, there are as
>many different way to specify the training database as there are
>separate scripts, and they all try to do too much.

Aye, the command-line tools are scattered around wherever the
original authors thought to put them, and were inconsistently
moved when things were put into subpackages, etc.  Too much
history, not enough planning.

I'm not sure that each of the scripts tries to do too much, though.

>IMHO there should be one script for each of the following tasks:
>
>  * training a bunch of messages

There has to be more than one script for this, given that people
present their training data in different ways.  Most notable is
the difference between having "ham" and "spam" corpora vs. having
"everything" and "spam" corpora.  Trying to combine both of these
approaches into a single script makes that one script bizarrely
complex... larger than the sum of two separate scripts.

>  * filtering a single message, ie. read it, score it, write it
>    back with "X-..." header(s) added

It would be good to have a single interface for this, but I think
that it would be better to have that as a python module interface
instead of as a full-fledged script.  After all, we don't really
want to have pop3proxy spawning processes just to filter messages.

>  * scoring one or more messages, ie. read each one, score it, and
>    print a single line with the results

I'm not sure if all the results that might be interesting for testing
would fit on a single line... timcv.py does a bunch of stuff behind
the scenes to generate the summaries and histograms, for instance,
which may be difficult to reextract from a single line for each message
(or perhaps just expensive to do so).

>  * export a database
>  * import a database

I personally don't see the problem with combining these as in
dbExpImp.py.

>They should all live in a 'scripts' directory (or something), and
>(naturally) they should use Optik/optparse for a consistent command-line
>interface.

There has historically been a divide between 'production use' and
'testing use' for scripts.  The argument for this divide is that
most people will be completely uninterested in testing, and by
segregating those scripts whose only purpose is testing we can
cater to those folks by not even showing them the testing stuff.
While this is laudable from the perspective of not confusing people
on first approach to the system, it does tend to make it harder
for them to graduate to actually quantifying the effects on their
mail and contributing to the (now effectively dead) testing cadre.

Do we want to re-merge everything into a single 'scripts' directory,
or do we want to retain the distinction between testing and production?

- Alex