[Spambayes] Foreign language spam: bug or feature?

Fri Oct 25 15:56:44 2002

[Tim, remarks about an Outlook client "bug" that caused Asian spam to
 get nailed via replacing most high-bit chars with question marks, leading
 to clue lists like this one:
]
> Spam Score: 1
>
> '*H*'                          0
> '*S*'                          1
> 'header:Return-Path:1'         0.611133
> 'header:Message-ID:1'          0.813889
> '15????'                       0.844828
> '24????'                       0.844828
> '7??????'                      0.844828
> '&amp;'                        0.863317
> 'header:Mime-Version:1'        0.89556
> 'header:Reply-To:1'            0.90756
> '10????'                       0.934783
> '??????!!!'                    0.934783
> 'header:Received:2'            0.957828
> '??????????)'                  0.958716
> '??????...'                    0.965116
> '????????...'                  0.965116
> 'message-id:@cpimssmtpa05.msn.com' 0.969799
> 'from:email addr:korea.com>'   0.980349
> '(????'                        0.981928
> '??.'                          0.985437
> 'e-mail??????'                 0.986322
> '????,'                        0.99505
> '????????,'                    0.995258
> '??????,'                      0.99545
> '????????.'                    0.997691
> '??????????.'                  0.99776
> 'skip:? 20'                    0.998034
> '????????????'                 0.998192
> '??????????'                   0.998474
> '??????'                       0.998562
> '????'                         0.998598
> '????????'                     0.998672
> 'skip:? 10'                    0.998894

MarkH subsequently fixed that bug by accident <wink>, while greatly speeding
the Outlook operations and making the Outlook client more robust.  My Asian
spam is *still* nailed, but via clue lists like this now:

'skip:\x92 40'                 0.958716
'skip:\x95 40'                 0.958716
'skip:\x96 30'                 0.958716
'skip:\x93 30'                 0.965116
'skip:\x93 50'                 0.965116
'8bit%:58'                     0.969799
'skip:\x82 10'                 0.969799
'skip:\x83 30'                 0.969799
'skip:\x8d 30'                 0.969799
'skip:\x93 20'                 0.969799
'subject:==?='                 0.969799
'skip:\x81 60'                 0.973373
'skip:\x93 10'                 0.973373
'url:jp'                       0.973373
'skip:\x81 10'                 0.97619
'skip:\x81 40'                 0.97619
'skip:\x82 30'                 0.97619
'subject:GyRCTCQ'              0.97619
'subject:iso'                  0.978469
'8bit%:69'                     0.980349
'skip:\x81 30'                 0.980349
'skip:\x81 20'                 0.981928
'8bit%:97'                     0.983271
'8bit%:72'                     0.988432
'8bit%:83'                     0.990405
'8bit%:87'                     0.990798
'8bit%:91'                     0.991159
'8bit%:81'                     0.99236
'8bit%:56'                     0.993274
'8bit%:88'                     0.994148
'8bit%:68'                     0.9947
'8bit%:85'                     0.9947
'8bit%:94'                     0.994822
'8bit%:50'                     0.994938
'8bit%:80'                     0.995258
'8bit%:75'                     0.99545
'subject:=?'                   0.996151
'8bit%:86'                     0.996562
'8bit%:93'                     0.99776
'8bit%:100'                    0.998375

The downside for me is that the database size took a significant hit, just
because there are a lot more potential "skip" tokens than strings of
question marks.  WRT correlation effects, a msg that has an 8bit% metatoken
under this scheme is likely to have lots of them, but is also likely to have
lots of distinct '?'*n tokens under the other scheme; in both cases,
counting them all as distinct clues actually helps nail this stuff as spam.

Unless someone has a strong objection, I expect to introduce a new option:

"""
[Tokenizer]
# If true, replace high-bit characters (ord(c) >= 128) and
# control characters with question marks.  This allows
# non-ASCII character strings to be identified with little
# training and small database burden.  It's appropriate only
# if your ham is plain 7-bit ASCII, or nearly so, so that
# the mere presence of non-ASCII character strings is known
# in advance to be a strong spam indicator.
replace_nonascii_chars: False
"""