[Spambayes] progress on POP+VM+ZODB deployment

Tim Peters tim.one@comcast.net
Sun Oct 27 23:43:18 2002


[Tim]
>> If one user signs up for a minister-by-mail scam (a real-life example
>> reported earlier on this list), then all users get minister-by-mail
>> scams.  Etc.

[Derek Simkowiak]
> 	I'm a little slow, so forgive me if this is... repetitive.  But
> your argument sounds like it something of a showstopper to my intended
> use of SpamBayes, and I want to make sure this behaviour is clearly
> documented in the archives.

Arguments don't count for much here:  you can set up a test and measure
results.  That's the only way to know.  I've told you my best guess, but
guesses here are often wrong.

> 	Consider a group of a people who all use the same mail server.
> I'm thinking of a university, or customers of one of those $20/month
> email services, or a 1000-person company.
>
> 	Now consider the sysadmin who wants to use SpamBayes for the
> purpose of flagging spam on that mail server, such that users can set
> up a generic filter rule that is easily supported by the organization's
> Help Desk.

It's not an application I've got in mind, and not one that I've tested or
intend to test.  Other people here are interested in this, but they don't
appear to be around today.

> 	The way I understand it, if any _one_ person in the group of
> people likes to get advertisements, porn mails, hotel conference info,
> and/or minister-by-mail, and SpamBayes is trained on all incoming mail,
> then everybody in the group will have their filtering rendered useless.

Minus the hyperbole, yes, unless you've done whatever it takes to inject
some recipient-specific smarts.  If it passes on porn spam to me, how could
it possibly block it for you otherwise?  For a start, it would have to know
that you and I are different.  And that's got nothing to do with Bayesian
filters, or any other technicality:  if different people call different
things spam (and they do -- that's a fact), and any scheme that doesn't know
the difference between people necessarily treats all people the same (that
sure *seems* to be a fact <wink>), then if it lets my porn spam through then
you get it too, or if it blocks my porn spam for you then it blocks it for
me too.  Either way one of us is left unhappy.

> 	In other words, Bayesian filtering (as popularized by the article
> "A Plan for Spam") is only good for individuals, or small groups of
> individuals who all like the same kinds of ham.

I think that's too extreme a conclusion.  For example, python.org serves up
tech lists for tens of thousands of users, and we have strong evidence that
a single classifier will work fine there.  Tech lists have a *shared* notion
of what's spam, though.

> 	I can't help but feel that I'm missing something.  In this
> setting, it seems like training on hams is quite destructive to the goal
> of flagging Spam.

The algorithm doesn't try to flag spam, it tries *separate* ham from spam,
and the characteristics of both populations feed into that.  We've come full
circle, and I'll repeat that SpamAssassin may be more to your liking.  It
does try to flag spam largely independent of any notion of ham, although
from what I've seen of SpammAssassin admins they spend a lot of time
crafting "positive rules" to try to let through things *their* site
considers to be ham.  Whitelists seem very effective for that, and I expect
some form of whitelist would help a large deployment of the spambayes code
too.  OTOH, different people also want different whitelists.

> 	What if we pretend that all hams have exactly .5 probability, that
> is, any given ham cannot be identified as either being a spam, or not
> being a spam.  That is, all hams are just random noise.
>
> 	Then we train against a huge collection of spam, like Bruce G.'s
> stuff.
>
> 	Each word in the database gets a "spam likelihood" rating,
> depending on what percentage of the time it shows up in the spams.  A word
> that shows up in every single spam gets a "1.0", and every word that does
> not appear in the spam at all gets a "0.0".

I don't know, and it doesn't seem to make sense in the statistical framework
the spambayes project is built around.  You could test it, though, by
fiddling our codebase.  For example, replace update_probabilities like so:

    def update_probabilities(self):
        """Update the word probabilities in the spam database.

        This computes a new probability for every word in the database,
        so can be expensive.  learn() and unlearn() update the probabilities
        each time by default.  Thay have an optional argument that allows
        to skip this step when feeding in many messages, and in that case
        you should call update_probabilities() after feeding the last
        message and before calling spamprob().
        """

        nspam = float(self.nspam or 1)

        S = options.robinson_probability_s
        StimesX = S * options.robinson_probability_x

        for word, record in self.wordinfo.iteritems():
            spamcount = record.spamcount
            assert spamcount <= nspam
            prob = spamcount / nspam

            # Now do Robinson's Bayesian adjustment.
            # ...
            prob = (StimesX + spamcount * prob) / (S + spamcount)

            if record.spamprob != prob:
                record.spamprob = prob
                self.wordinfo[word] = record

BTW, there's no need to train on ham at all then (doing so would have no
effect on computed spamprobs).

> We throw out ueber-common words like a, and, the, it, just like Google
> does for its searches, as a matter of efficiency.

It's not really a matter of efficiency, it's more that since "a" appears in
virtually every spam *and* ham, the spamprob of "a" will be approximately
1.0 if you ignore hamcounts (it's approximately 0.5 now).  Note too that any
ham that just happens to mention "money" will also have a very high spamprob
word.  Words you used in this email:

'filtering'                    0.844828
'plead'                        0.844828
'is...'                        0.844828
'company.'                     0.895746
'spam.'                        0.899585
'scam'                         0.908163
'like.'                        0.934783
'flagging'                     0.958716
'rated'                        0.983271
'porn'                         0.988998

will have even higher spamprobs than those, because there will be no
hamcounts to counteract them.  Indeed, all words will have higher spamprobs
than they have now.

> 	Then every email is rated word-by-word.  The scores for all the
> words are then averaged together.  So an email with many words commonly
> found in spam gets a high rating... (?)

Here's a spam I picked at random from my personal collection.  Which words
in this can you hope to get a high spam rating?

"""
Hi i read your profile and you live in my area.  Maybe we could chat on line
or even meet for a coffee. If you would like to come and chat with me
i will be on line most of the night
at http://www.designerlove.com/?rid=love2
My screen name is "PenPal"
Log in and i'll be in the chat section. Hope to see you soon.
"""

As a matter of fact, none of those words are *common* in spam, except for
words like "and", "the" and "on".  My classifier nails it anyway (score of
0.97), because while words like "chat" appear in a small percentage of my
spam, they appear in even less of my ham (peeking inside a fat c.l.py
classifier, 'chat' appeared in 20 of 18,000 ham, and 164 of 12,600 spam:
it's rare by any measure, what matters here is that it's *relatively* rarer
in my ham than in my spam; and likewise for 'coffee.', and so on).

> 	Um, I've overstepped my understanding of the problem, so I'll just
> stop there.  But to you algorithm geniuses, I plead for a way to filter
> spam that depends only on previously-seen Spam, and that does not depend
> on what ham looks like.

Why do you think you get so much spam?  One reason is that one-size-fits-all
schemes don't work well.  I'd like to plead for world peace too, while the
algorithm geniuses are at it <wink>.