[Spambayes] Leaving for another tool.

Mon Dec 10 16:04:19 CET 2007

Sounds like there's a research paper in here somewhere, should anyone
want one. I suspect that I'm one of a large majority of contented
SpamBayes users, but that's hard to know. I'm quite happy with the
filtering obtained by training SpamBayes using what seems to me an easy
and intuitive approach: if SpamBayes puts a message in the wrong place,
I move it where it belongs.

But there are some number of people who get results that leave them
frustrated and annoyed. It would be interesting to know why that is.
Different data? Different training methods? Different expectations?
Something else altogether?

Maybe it's an indictment of the open source process that no one has
answered these questions. There's no one to commission the research,
just a few good souls who had a need, saw a possible solution, scratched
their own itch, and kindly made the resulting software available for the
rest of us. On the other hand, maybe it's just a matter of time, and
Pete or Thomas or someone else will have the interest and resources to
puzzle out an explanation for the wide range of experiences.

________________________________

From: spambayes-bounces at python.org [mailto:spambayes-bounces at python.org]
On Behalf Of Audiography Support
Sent: Saturday, December 08, 2007 7:08 PM
Cc: spambayes at python.org
Subject: Re: [Spambayes] Leaving for another tool.

Skip, I understand your concern - too much data is sometimes worse than
not enough. And I'm not a mathematician by any stretch of the
imagination. Programmer, yes, hardware designer, yes, PC guru, yes (in
another life), Geek of the week 5 times running, yes, but not
mathematician.

But if I start using a tool that looks for certain words, patterns,
phrases, and so on in messages in order to identify similar messages in
the future, then it's really counter-intuitive to say "don't train it so
much, you'll break it." Why not?

The problem I've run into (every day now) is that no matter how much or
little training I used, SB/TB doesn't work the way everyone tells me a
bayesian filter is supposed to. And when I try to give examples, people
say "oh, that's not the way it works!". What way does it work, then? I'm
either training it too much, or not enough!

If I look at the scores given by Spambayes and Thunderbayes to words
like penis, viagra, etc, they are ridiculously low compared to words
like "sell" or "friend" or "the" - when those 'bad' words don't occur -
ever - in good messages! Never, ever, ever. So the technology should be
able to look at those words and say "hey, this word only ever appears in
bad messages, so I'm going to weight it like hell and mark it as bad!",
without me needing to train it. That didn't happen.

By way of example (and this probably says as much about my lack of
mathematical knowledge as anything else!):

I had a "penis" list of messages. These messages all contained the word
"penis". I had 1,000 of these messages, which typically consisted of
20-150 words each, with a lexical word base of about 150 words. Then, I
had a list of identical messages, with the word "penis" taken out. So I
reset SB (completely), then trained it on the bad messages as spam, and
the "edited" messages as ham. When I fed it a new message, identical to
another of the previously trained bad messages, it scored it as 8%. How
in the name of the Oxford Abridged Dictionary can that be calculated as
right?

I admit  this isn't the way SB should be used. But if it can't
distinguish such a fundamentally simple concept - out of 150 words, 149
are common, and the one "uncommon" word - which appeared 1,000 times
only in bad messages - was weighted at about 1 chance in 16 of being
"bad"? And the word "yesterday" was weighted at 22%. That's the kind of
number that makes me say "this tool doesn't work". More training doesn't
fix the problems, but I thought it would. That's my mistake!

OK, there are other considerations, such as the mail headers, character
formatting, misspellings, character sets, and so on - but there is no
easy (or difficult, for that matter) to configure ANY bayesian filter
I've EVER seen to work the way I need it. I can't seem to tell it to
ignore headers - and I know for a fact that header data is taken into
account when training - but since so much of the header data (taken as
individual words) is identical, where's the tool to let me tell SB not
to score "x-Mozilla-Status" or "Envelope-to" in every single damn piece
of email I get? But if I look at the training dbs (assuming I can find
them and see what's in them), I find those same terms are
weighted/scored just as high as "bad" words!

 So what I'm narked off about (and don't misunderstand me, I'm
absolutely frustrated as hell at wasting close to 100 hours with
installing, training, resetting uninstalling, reinstalling, retraining,
and then rescuing email from the junk pile and simultaneously manually
putting spam into the training folder or the junk folder, and repeat) is
the fact that the technology doesn't work the way it's supposed to. Not
for me, anyway. And I'm a poster boy for nerds, believe me.

Maybe the problem is SB is trying to be all things to all users. But
what I've learned from the last 5 months is that the SB and TB tools are
not easy or friendly or configurable. They're difficult to get going,
difficult to maintain, unable to be truly configured to specific needs
(that are actually common across every smtp/pop email client on every
OS), and they don't like not being trained almost as much as they hate
being over trained!

The one time I tried to "get into" a spam table (after exporting it to
XML and re-importing it with some words weighted more heavily), it
completely broke TB - but not because of the weighting I used, it was
because the XML import filter doesn't actually import valid XML, and the
export XML filter doesn't export valid XML.

Yep, I'm sticking to what works for me until I have enough free time to
try and add something to the bayes community. I really do want to try
and improve the bayes filter technology, because it should work better.

Thomas Hruska wrote: 

	skip at pobox.com wrote: 

		    Pete> Way too many false negatives (still running at
around 7%, after 
		    Pete> 13,000+ spam training messages and 50,000+
good training 
		    Pete> messages), 
		Way too large a database.  Train on just mistakes and
unsures.  If you've 
		trained on over 60,000 messages you must be training on
everything you 
		receive. 

		Good luck with K9.  Sounds like it's doing the trick. 

		Skip 

	This only proves that Spambayes needs an autobalancing ham/spam
feature built in by default.  Users train on everything in the hopes of
eliminating all spam from the in-box.  Also, by _your_ logic, the
default training mechanism in Spambayes should be to NOT train on spam.
VERY counter-intuitive. 

	In terms of usability, Spambayes is clearly designed "by geeks,
for geeks" but since this tool has appeared in major computing magazines
that _users_ read, the tool needs to change to fit the mindset of those
who will actually use the product.  What Pete said has crossed my mind
quite frequently while using the tool.  Your focus is on "training
database size" rather than the user's actual complaint:  That the
product is not _usable_.  I can use and understand the product only
because I'm a geek.  However, it needs to be significantly simplified so
a user can use it. 

	Maybe your goal is only to cater to geeks.  If that's the case,
you need to state it somewhere at the top of your homepage and drop the
support for the Outlook add-in - at which point I too will probably stop
using the tool because there is no hope for it...ever,  Users will not
take the time to learn to use the tool how you want them to use it.  If
they see spam, they are going to train it as spam no matter how large
their training database gets.  That is how users think.  Developing
software is more about psychology than code:  Study the user and code
accordingly. 

	Sorry for the rant.  I've been feeling the same way as Pete and
wanted to put what he said into a little different perspective - perhaps
one that you'd understand better.  You completely ignored Pete's very
lengthy e-mail on what it means to be a user of Spambayes from his
perspective and instantly focused on the one sentence that is useless to
him but is "comfortable" for you.  I'm hoping this helps craft an
improved tool rather than write Pete, me, and other users like us off as
"annoying".  I know that you are going to be upset when you read this
but I don't care if you hate me as long as you end up going back and
pondering Pete's e-mail.  His words, from just that one e-mail, are
capable of guiding how Spambayes should be developed for the next 5
years. 

-- 

Peter Naus

Audio Engineering Manager

Audiography

20 Churinga Avenue, Mitcham. Victoria. 3132. Australia.

Freecall: 1300 78 4576

Phone: +613 8802-4562

e-mail: support at audiography.com.au <mailto:support at audiography.com.au> 

web: http://www.audiography.com.au <http://www.audiography.com.au/> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20071210/24309bcd/attachment.htm