[Spambayes] test sets?

Anthony Baxter anthony@interlink.com.au
Sat, 07 Sep 2002 14:00:36 +1000


>>> Tim Peters wrote
> > As well as the usual spam, it also has customers complaining about
> > credit card charges, it has people interested in the service and
> > asking questions about long distance rates, &c &c &c. Lots and lots
> > of "commercial" speech, in other words. Stuff that SA gets pretty
> > badly wrong.
> 
> Can this corpus be shared?  I suppose not.

Almost certainly 100% not, at least not without a massive massive
amount of manual cleansing. There's just too much personal data in
there.

> > I did have Received in there, but it's out for the moment, as it causes
> > rates to drop.
> That's ambiguous.  Accuracy rates or error rates, ham or spam rates?

It made both the f-p and f-n rates drop. I need to think a bit more
about why - I'm currently thinking about a special tokeniser just for
received, so that, e.g., hostnames like 
'pcp736393pcs.reston01.va.comcast.net' gets turned into

received:pcp736393pcs.reston01.va.comcast.net
received:reston01.va.comcast.net
received:va.comcast.net
received:comcast.net

Specialising the tokeniser for various headers actually seems to do
some good - in particular, keeping the parameters and their values of
the content-types makes for a good detector of korean spam.

> Mining embedded http/https/ftp thingies cut the false negative rate in half
> in my tests (not keying off href, just scanning for anything that "looked
> like" one); that was the single biggest f-n improvement I've seen.  It
> didn't change the false positive rate.  So you know whether src added
> additional power, or did you do both at once?

Both at once. I added it because <iframe src=cid:foofoofoo> is such a
killer detector of spam/viruses, also because I got a bunch of email
spam that was just

    <img src="bozo.bozo.kr/img34532.jpg">
    <img src="bozo.bozo.kr/img34512.jpg">
    <img src="bozo.bozo.kr/img34237.jpg">
    <img src="bozo.bozo.kr/img34914.jpg">

I'm already stripping out HTML tags - it was producing far far too
many false positives with my corpus with them in. Without the src/hrefs
these spams were pretty much null and void.