[spambayes-dev] Interesting unsure

Tim Peters tim.one at comcast.net
Thu Jun 26 00:49:31 EDT 2003


[Skip]
> ...
> the subject had umlauts over many of the vowels:
>
>     ousp Wänt to makë löve likë a teën?
>
> so of course, I got several tokens which the classifier ignored.  The
> debug and classification headers were

For body (but not header) tokenization, the option replace_nonascii_chars
(off by default) is very effective against junk like this, at least for
those whose ham is mostly 7-bit ASCII.  That option replaces each "funny
character" with a question mark.  So, e.g., any oddball spelling for "o" in
"love" turns the token into "l?ve"; the occasional Euro-name in ham isn't
really hurt by this at all.  I expect it would also be effective if applied
to headers.  OTOH, I don't recall getting any Unsures where this would have
tipped the score into my Spam range.

Indeed, my Unsures this week are utterly dominated by trash bouncing back to
various webmaster and admin addresses due to the Sobig worm forging sender
addresses, like

"""
From the U.S. Courts Hostmaster:
Content violation found in email message.

From: webmaster at python.org
To: pacer at psc.uscourts.gov
File(s): details.pif

Matching filename: *.pif
"""

It occurs to me that I haven't had "a spam problem" since last year -- now
I've got "a virus bounce problem" <0.5 wink>!





More information about the spambayes-dev mailing list