[spambayes-dev] Tricky false positive: US states

Mon Oct 6 23:01:16 EDT 2003

[Skip Montanaro]
> ...
> This might be worth investigating.  Can't we compute the correlation
> between two tokens by keeping track of how frequently they appear in
> the same message?  If we know "chicago" and "illinois" are very
> strongly correlated, we can potentially choose to ignore one or the
> other.  This could reduce the size of the database substantially, and
> also work toward a situation where we believed more strongly -- with
> some justification -- that our consultants recommendations were
> accurate; that a politician wasn't paying them off behind the scenes,
> figuratively speaking.
>
> It would appear that this is an O(n*n) problem, since to accurately
> decide correlation between any two tokens we have to consider how
> each token correlates with all others.  The problem size can probably
> be simplified in various ways to avoid performing a full comparison.

We discussed this briefly in the early days.  I don't know an efficient way
to do this (whether in time or space, where "efficient" == linear in the
number of tokens; even mixing unigrams with bigrams is still linear-time and
linear-space, and that little extension boosts time and space requirements
dramatically enough by itself).

There are other kinds of classification algorithms that don't assume
independence of evidence sources.

In this algorithm, I think it's the case that token correlation
overwhelmingly more often helps us than hurts us (like "spambayes" and
"tokens" are probably both strongly hammy in your db, and treating them as
independent helps nail *this* msg as ham for you, despite that it also
contains penis and human growth hormone <wink>).