[Spambayes] Re: Guidance re pickles versus DB for Outlook
T. Alexander Popiel
popiel@wolfskeep.com
Tue Nov 26 21:42:29 2002
In message: <w53el9882i2.fsf@woozle.org>
Neale Pickett <neale@woozle.org> writes:
>So then, jeremy@alum.mit.edu (Jeremy Hylton) is all like:
>
>> I'm a big fan of unittests. We should probably develop some.
>
>I couldn't agree more.
Err, before we start writing unit tests, shouldn't we have some
specifications on what everything is actually supposed to do?
While we're at it, it would probably be good to gather requirements
and lay out interfaces for the various elements...
Here's a first cut at what I see as requirements for the code:
R1. The code should be written in a uniform language, as much as
is possible. For historical reasons, this language is probably
going to be python version 2.2.1.
R2. The code (in training mode) should accept email messages and
their classifications as inputs, and record relevant data
for later classification.
R3. The code (in classification mode) should accept email messages
as input and offer trinary (ham/spam/unsure) classification as
output.
R4. Raw RFC 822 messages should be acceptable as input email messages.
R5. For the MS users, the botch that Outlook turns messages into should
be acceptable as input email messages. Despite my opinion of it. ;-)
R6. Classifications should be output as a line of the form:
X-Spambayes-Classification: ham/spam/unsure
where only one of ham/spam/unsure should be present, without /.
R7. Classifications may be added to the headers of a raw rfc822
message, in which case the whole (annotated) message should be
echoed as output.
R8. In no case should more than one classification be present in
a message.
R9. If a message provided as training input already has been trained
with a classification, then it should be untrained from the old
classification before training with the new classification.
R10. There should be a classifier front-end usable as a procmail filter.
R11. There should be a classifier front-end usable as a pop3 proxy.
R12. There should be a classifier front-end usable as an Outlook plugin.
R13. There should be training front-ends appropriate to each of the
classifier front-ends.
R14. The classifier front-ends should use a common internal classifier
module/class/whatnot to do all work not specifically related to
managing input and output.
R15. The training front-ends should also use a common internal training
module/class/whatnot to do all work not specifically related to
managing input and output.
R16. There should not be any costly 'process all the data' operation
associated with either training or classification.
R17. The internal database for the knowledge gleaned from training
should be stored in persistent form between invocations.
R18. Changes to the internal database should be reflected in the
persistent store in a timely manner.
R19. Changes to the persistent representation of the database should
be done with an eye towards recoverability of the data in the
case that a power outage (or similar catastrophic event)
interrupts the update.
R20. The classification method should consist of some combination
scheme applied to ham/spam probabilities associated with tokens
derived from parsing the email messages.
R21. The chi-square combination method should be an allowed combination
scheme.
R22. The gary-combining method may be an allowed combination scheme.
R23. The modified Graham probability computation (without biases,
with bayesian adjustment, etc.) should be an allowed probability
computation schere.
R24. The tokenization scheme should be based on the recent spambayes
tokenizer code, which I don't feel like describing in sufficient
detail at this time.
etc...
- Alex
More information about the Spambayes
mailing list