[Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters tim.one@comcast.net
Wed, 28 Aug 2002 16:59:39 -0400


[Paul Graham]
> Don't count words multiple times, and you'll probably
> get fewer false positives.  That's the main reason I
> don't do it-- because it magnifies the effect of some
> random word like water happening to have a big spam
> probability.

Yes, that makes sense, but I'm trained not to think <wink>.  Experiment will
decide it (although I *expect* it's a good change, and counting multiple
occurrences was obviously a factor in several of the rare false positives).
If spam really is different, it should be different in several distinct
ways.

> (Incidentally, why so high?  In my db it's  only 0.3930784.)  --pg

I expect it's because this tokenizer *only* split on whitespace.
Punctuation was left intact.  So, e.g., on the Python discussion list stuff
like

    The new approach blows it out of the water:
and
    This is very deep water;
and
    Then you'll take to Python like a duck takes to water!

are counted as "water:" and "water;" and "water!", not as "water".

The spam corpus is chock full o' "water", though:

+ Porn sites advertising water sports.
+ Assorted bottled water pitches.
+ Assorted "oxygenated water" pitches.
+ Claims of environmental friendliness explicated via stuff like
  "no harmful chlorine to pollute the water or air!".
+ Pitches for weight-loss gimmicks emphasizing that you'll really
  loss fat, not just reduce water retention.
+ Pitches for weight-loss gimmicks empphasizing that you'll reduce
  water retention as well as lose fat.
+ One repeated bizarre analogy for how a breast enlargement cream
  works in the way "a sponge absorbs water".
+ This revolutionary new flat garden hose will really cut your water
  bills.
+ Ditto this miracle new laundry tablet lets you use a fraction of
  the water needed by old-fashioned detergents.
+ Survivalist pitches often mention water in the same sentence as
  air and medical care.

I got tired then <wink>.