[Spambayes] progress on POP+VM+ZODB deployment

T. Alexander Popiel popiel@wolfskeep.com
Mon Oct 28 18:11:24 2002


In message:  <Pine.LNX.4.33L2.0210280842260.30862-100000@dev.itsite.com>
             Derek Simkowiak <dereks@itsite.com> writes:
>
>	To summarize: I think it's the job of a spam filter (or "flagger")
>to identify those messages univerally accepted as being spam -- whether or
>not any one person likes that kind of mail.

I'm reasonably sure there is no consensus on the definition of spam,
so the concept of 'universally accepted' spam is flawed at its root.
Some people restrict it to unsolicited commercial email; some consider
any marketing message to be spam.  Some don't care if its commercial
or not.  Worst, for the lowest-common-denominator UCE definition,
knowledge of the individual users is required (whether they solicited
it or not).

As such, I'd say your ideal universal flagger concept is unrealizable.

Even if the concept is sound, I think that the classifiers we're working
with are a bad fit for your concept, since at their core they need to
know something about what's good as well as what's bad.  Otherwise, you
end up saying stuff is spam because it used the words 'you', 'there',
'some', 'the', etc... the incidentals of the language, with no real
import on the message.

>	I've seen many people on this list use Bruce's spam for their
>training.  But undoubtedly there is a message in his collection that would
>be of interest to at least *someone* on this list.  Does that invalidate
>his collection as being a spam training repository?

I have avoided using _any_ outside source of spam, precisely because
I don't trust their judgement on my mail.  If there's a classification
error, I want it to be tracable only to me, not to some other person's
potentially warped ideas about mail.  (Note that this is not to say
that I think Bruce's collection is bad or warped... I haven't looked
at it, so cannot say.  I'm just paranoid about my mail.)

>	I would say no, it does not, because his collection is of the type
>"universally accepted as spam".  That is the type of message I would like
>to see flagged at Universities, ISPs, and companies.
>
>	And to do that, I don't think ham training can be in the picture,
>since somebody's "ham" is another person's "spam", and training on
>people's "ham" can only weaken what is considered "universally accepted as
>spam".

I'll run some experiments (I've been doing the most with ham:spam ratio,
anyway), but I suspect that without any ham the spambayes classifier
will fail horribly.

- Alex