[Spambayes] Moving closer to Gary's ideal

Guido van Rossum guido@python.org
Sun, 22 Sep 2002 23:27:30 -0400


> > So to match the 40 fp's from Graham's scheme, I'd need to set the
> > cutoff to 0.60; that would give me 41 fp's here (16+10+3+1+1+2+8).
> 
> Right!  BTW, that's a huge test run -- certainly the largest to date.

I finally looked at my fp's in detail, which I should have done right
away.  There were 4 empty files (MH refiling accidents probably) and
11 spams that I had somehow saved.  I'm rerunning everything with
these removed now.

Among the remaining fp's were a few forwarded spams (several by you,
Tim!); a bunch of automated responses from web sites where I ordered
stuff; a bunch of product newsletters that I like to get because I use
the product; two pieces of legit email (largely) in Spanish; a
legitimate job posting sent to jobs@python.org; a list of diet tips
forwarded by my wife; a message from a publisher asking where to send
my check (I'd hate to have misfiled that one!); a bunch of very short
messages (some with, some without ISP-added trailers); a few messages
with HTML alternatives; and one question about HTML style sheets that
quoted a few lines of typical stylesheet gibberish (not HTML, but this
stuff is often inlined in HTML).  One of the brief questions used
charset=GB2312 (whatever that is); there were a lot of bogus 8-bit
characters, but in the middle it said loud and clear 'Where can I get
a "wxpython"?'.

> > I'm going to have to look at the fp's and fn's to see if there are
> > real spams hiding in the ham, and vice versa.
> 
> If there aren't, you'll be the first tester ever not to discover
> some.

See above. :-)

> For example, I've found 3 hams in BruceG's spam collection so
> far, and I believe you're using that too (but much more of it).

I've not found the courage to look at the hundreds of fn's.

Suggestion: rather than showing the content of the fn's and fp's (the
filenames are enough for me), would it be possible to show the
filenames corresponding to the outliers in the ham/spam distributions?
E.g. there's 1 message in my spam collection that scores 37.50
acfording to the overall histogram.  How to find that one?

> BTW, it's my belief that this all works *best* if the ratio of ham
> to spam trained on matches your real-life inbox ratio.

That's impossible to know in my case.  Almost all of my mail goes
through the SpamAssassin setup at python.org, which throws all spam
away.  As a result I see maybe 1 spam for every 50 hams -- but that's
not the spam/ham ratio seen by the MTA for guido@python.org.

> > I did notice that many fp's were very spammish automated postings
> > that I have specifically signed up for, like our building's
> > announcements, product newsletters, and so on.  I haven't looked at
> > the fn's.
> 
> I expect these are your moral equivalents to the conference
> announcements in my c.l.py ham, except worse.  However, I expect you
> have more cause for optimism about those: you (like me) are running
> a crippled version of the algorithm because of your mixed-source
> corpora.  The headers we're ignoring are bound to have strong clues
> about the *senders* of the spammish stuff you've signed up for.

Only if I saved enough of these, right?  Any clue as to what option to
try?

> It would be nicer to display them side-by-side, but you'd have to learn how
> to use more of your available screen area then <wink>.

Jeremy would suggest to generate gnuplot input so we can draw them in
multiple colors. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)