[spambayes-dev] RE: [Spambayes] How low can you go?

Thu Dec 25 17:10:24 EST 2003

[Seth Goodman]
> ...
> If I can have the Outlook binary and non-Outlook source working at
> the same time,

Probably, but I don't really know.  I run the Outlook addin directly from a
CVS checkout of spambayes, and have never used the binary installer (I don't
object to it <wink>, it's just that using it would consume a little more
non-existent "spare time").

> is there a way to convert my saved Outlook mail folders to mbox
> format

export.py in the spambayes Outlook2000 directory works fine, and I just
checked in a pile of changes so it works even finer.

> so that I _can_ see how the changes I make work on my own mail
> stream as well?

That's potentially more difficult than what you've (or I've!) been doing:
to run "what if I changed this or that?" experiments, you need to save every
email you ever get, and ensure that each one is correctly classified.  Else
you're not reproducing your original email stream, so it's anyone's guess
then what you'd really be testing.

Two days ago I created a new .pst file, with two folders "All ham" and "All
spam".  Since then I've been copying each message I get into one of them.
When it comes time to use export.py, I'll have to temporarily fiddle my
spambayes config to say that "All ham" is my (only) ham folder and "All
spam" my (only) spam folder (export.py gets its idea of where your ham and
spam training data are from your Outlook spambayes config file).

Copying all incoming msgs is a bit of a PITA for me, and if you use Outlook
rules too (I don't) to sort ham into different folders, may be a royal PITA.
So it goes -- Outlook wasn't designed for running spam-filter experiments
(then again, no email client was, and that's why we have a "standard"
test-data directory structure of our own).

Ah, I've noted before that I throw away half my Unsures unclassified,
because I can't tell whether they're ham or spam (these are usually barely
intelligible msgs addressed to public "admin" or "help" kinds of addresses).
I'm making an arbitrary guess about each of those too, and saving a copy in
"All ham" or "All spam".  I *expect* a relatively high Unsure rate because
of this aspect of my email mix.  No part of the testing framework can be
talked into believing that Unsure is the *desired* outcome for a msg,
though, so I either have to make a guess about each, or damage the
experimental setup in unknown ways by not saving *all* my email.