[Spambayes] Re: Guidance re pickles versus DB for Outlook

Tue Nov 26 21:42:29 2002

In message:  <w53el9882i2.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>So then, jeremy@alum.mit.edu (Jeremy Hylton) is all like:
>
>> I'm a big fan of unittests.  We should probably develop some.
>
>I couldn't agree more.

Err, before we start writing unit tests, shouldn't we have some
specifications on what everything is actually supposed to do?
While we're at it, it would probably be good to gather requirements
and lay out interfaces for the various elements...

Here's a first cut at what I see as requirements for the code:

R1. The code should be written in a uniform language, as much as
    is possible.  For historical reasons, this language is probably
    going to be python version 2.2.1.

R2. The code (in training mode) should accept email messages and
    their classifications as inputs, and record relevant data
    for later classification.

R3. The code (in classification mode) should accept email messages
    as input and offer trinary (ham/spam/unsure) classification as
    output.

R4. Raw RFC 822 messages should be acceptable as input email messages.

R5. For the MS users, the botch that Outlook turns messages into should
    be acceptable as input email messages.  Despite my opinion of it. ;-)

R6. Classifications should be output as a line of the form:
    X-Spambayes-Classification: ham/spam/unsure
    where only one of ham/spam/unsure should be present, without /.

R7. Classifications may be added to the headers of a raw rfc822
    message, in which case the whole (annotated) message should be
    echoed as output.

R8. In no case should more than one classification be present in
    a message.

R9. If a message provided as training input already has been trained
    with a classification, then it should be untrained from the old
    classification before training with the new classification.

R10. There should be a classifier front-end usable as a procmail filter.

R11. There should be a classifier front-end usable as a pop3 proxy.

R12. There should be a classifier front-end usable as an Outlook plugin.

R13. There should be training front-ends appropriate to each of the
     classifier front-ends.

R14. The classifier front-ends should use a common internal classifier
     module/class/whatnot to do all work not specifically related to
     managing input and output.

R15. The training front-ends should also use a common internal training
     module/class/whatnot to do all work not specifically related to
     managing input and output.

R16. There should not be any costly 'process all the data' operation
     associated with either training or classification.

R17. The internal database for the knowledge gleaned from training
     should be stored in persistent form between invocations.

R18. Changes to the internal database should be reflected in the
     persistent store in a timely manner.

R19. Changes to the persistent representation of the database should
     be done with an eye towards recoverability of the data in the
     case that a power outage (or similar catastrophic event)
     interrupts the update.

R20. The classification method should consist of some combination
     scheme applied to ham/spam probabilities associated with tokens
     derived from parsing the email messages.

R21. The chi-square combination method should be an allowed combination
     scheme.

R22. The gary-combining method may be an allowed combination scheme.

R23. The modified Graham probability computation (without biases,
     with bayesian adjustment, etc.) should be an allowed probability
     computation schere.

R24. The tokenization scheme should be based on the recent spambayes
     tokenizer code, which I don't feel like describing in sufficient
     detail at this time.

etc...

- Alex