[Spambayes] Foreign language spam: bug or feature?
Tim Peters
tim.one@comcast.net
Fri Oct 25 15:56:44 2002
[Tim, remarks about an Outlook client "bug" that caused Asian spam to
get nailed via replacing most high-bit chars with question marks, leading
to clue lists like this one:
]
> Spam Score: 1
>
> '*H*' 0
> '*S*' 1
> 'header:Return-Path:1' 0.611133
> 'header:Message-ID:1' 0.813889
> '15????' 0.844828
> '24????' 0.844828
> '7??????' 0.844828
> '&' 0.863317
> 'header:Mime-Version:1' 0.89556
> 'header:Reply-To:1' 0.90756
> '10????' 0.934783
> '??????!!!' 0.934783
> 'header:Received:2' 0.957828
> '??????????)' 0.958716
> '??????...' 0.965116
> '????????...' 0.965116
> 'message-id:@cpimssmtpa05.msn.com' 0.969799
> 'from:email addr:korea.com>' 0.980349
> '(????' 0.981928
> '??.' 0.985437
> 'e-mail??????' 0.986322
> '????,' 0.99505
> '????????,' 0.995258
> '??????,' 0.99545
> '????????.' 0.997691
> '??????????.' 0.99776
> 'skip:? 20' 0.998034
> '????????????' 0.998192
> '??????????' 0.998474
> '??????' 0.998562
> '????' 0.998598
> '????????' 0.998672
> 'skip:? 10' 0.998894
MarkH subsequently fixed that bug by accident <wink>, while greatly speeding
the Outlook operations and making the Outlook client more robust. My Asian
spam is *still* nailed, but via clue lists like this now:
'skip:\x92 40' 0.958716
'skip:\x95 40' 0.958716
'skip:\x96 30' 0.958716
'skip:\x93 30' 0.965116
'skip:\x93 50' 0.965116
'8bit%:58' 0.969799
'skip:\x82 10' 0.969799
'skip:\x83 30' 0.969799
'skip:\x8d 30' 0.969799
'skip:\x93 20' 0.969799
'subject:==?=' 0.969799
'skip:\x81 60' 0.973373
'skip:\x93 10' 0.973373
'url:jp' 0.973373
'skip:\x81 10' 0.97619
'skip:\x81 40' 0.97619
'skip:\x82 30' 0.97619
'subject:GyRCTCQ' 0.97619
'subject:iso' 0.978469
'8bit%:69' 0.980349
'skip:\x81 30' 0.980349
'skip:\x81 20' 0.981928
'8bit%:97' 0.983271
'8bit%:72' 0.988432
'8bit%:83' 0.990405
'8bit%:87' 0.990798
'8bit%:91' 0.991159
'8bit%:81' 0.99236
'8bit%:56' 0.993274
'8bit%:88' 0.994148
'8bit%:68' 0.9947
'8bit%:85' 0.9947
'8bit%:94' 0.994822
'8bit%:50' 0.994938
'8bit%:80' 0.995258
'8bit%:75' 0.99545
'subject:=?' 0.996151
'8bit%:86' 0.996562
'8bit%:93' 0.99776
'8bit%:100' 0.998375
The downside for me is that the database size took a significant hit, just
because there are a lot more potential "skip" tokens than strings of
question marks. WRT correlation effects, a msg that has an 8bit% metatoken
under this scheme is likely to have lots of them, but is also likely to have
lots of distinct '?'*n tokens under the other scheme; in both cases,
counting them all as distinct clues actually helps nail this stuff as spam.
Unless someone has a strong objection, I expect to introduce a new option:
"""
[Tokenizer]
# If true, replace high-bit characters (ord(c) >= 128) and
# control characters with question marks. This allows
# non-ASCII character strings to be identified with little
# training and small database burden. It's appropriate only
# if your ham is plain 7-bit ASCII, or nearly so, so that
# the mere presence of non-ASCII character strings is known
# in advance to be a strong spam indicator.
replace_nonascii_chars: False
"""