[spambayes-dev] Tricky false positive: US states

Tue Oct 7 22:03:41 EDT 2003

[Skip Montanaro]
> I was thinking more along the lines of deleting "chicago" or
> "illinois" (but not necessarily both) if they were strongly
> correlated.  Furthermore, I was thinking that the correlation would
> be done at training time, not scoring time.  That would probably
> wreak havoc with incremental training (though maybe you could keep a
> separate correlation database to assist there).  I assumed the core
> classifier would remain the same, just that the tokens in the
> training database would (hopefully) be more independent predictors.

It's not clear that increasing token independence would help the spambayes
algorithm.  Gary Robinson and I have both often noted here that correlation
appears to help us more than it hurts us.  That hasn't been put to a formal
test, of course (apart from the early days leaving HTML tags intact, which
is an extreme case), so it's quite subject to refutation.  OTOH, the cases
where correlation does hurt us remain so noteworthy that people still post
an example here on the rare occasion they bump into one!

Paul Wagland posted a strong argument for believing that Richie's "state
name" correlations wouldn't have been learned from training data, and if
that's so, developing a correlation measure wouldn't help his example.

Another possibility is to partition words into equivalence classes based on
human knowledge (such as we applied to Richie's example), picking an
arbitrary member of each class as its (fixed) canonical representative, and
replacing each word with its class's representative.  Then, e.g., each of
Alabama, Alaska, etc might be replaced with "Minnesota", and only Minnesota
would appear in the database.  Since we already weed out duplicates in the
training and scoring algorithms, a message containing any number of state
names would contribute exactly one instance of "Minnesota" to the
calculations.

There are dangers there too, in part because so many English words have so
many distinct context-dependent meanings.  That makes it safer to learn
correlations, and confidence in them, from the training data.  Learning also
makes correlations unique to each classifier, and so at best very hard for a
spammer to systematically outwit.

But it is an O(N**2) problem (because there are O(N**2) distinct token
pairs), and the way we define tokens can leave N very large.  N**2 is then
unthinkably large.

Correlation analysis might be restricted to extreme-spamprob words (meaning
close to 0 or to 1) that have been seen "often".  That could make it
computationally tractable for casual use.  An example might be the mountain
of distinct "it came from python.org via Mailman" tokens, which are all
strong ham clues in many of our databases.  I believe that's the single case
where correlation routinely harms my classifiers now, driving the rare spam
that sneaks thru python.org via Mailman into my Unsure folder.  OTOH, that
same hammy correlation probably also saves some ham from getting pushed up
to Unsure -- we seemed to run out of pure wins a long time ago.