[Spambayes] Tokens

Skip Montanaro skip at pobox.com
Sat Aug 16 10:01:23 EDT 2003


 
    Steve> Can someone please explain message tokens?  Technical answer is
    Steve> fine.

Tokens are roughly words, so if I tokenized "Mary had a little lamb", the
tokens would be

    ["Mary", "had", "a", "little", "lamb"]

However, there is structure and context in messages, stuff like where it
came from, how many people it was sent to, how large it is, if it has
non-ASCII text, what URLs are embedded in the message, and so on.  Some of
this "meta" information is useful when trying to distinguish spam from ham.
The SpamBayes tokenizer attempts to quantify this stuff as synthetic
tokens.  My token database has this entry:

    >>> db["url:python"]
    (143, 5248)

This simply means that URLs containing "python" appeared 5248 times in hams
and only 143 times in spam.  The synthetic token "url:python" is thus a
fairly strong indicator that a message is ham.

Skip




More information about the Spambayes mailing list