[Mailman-Users] Chinese characters spam filter?

Fri Jul 8 21:04:52 EDT 2016

Hi,

On 07/07/16 04:41, Mark Sapiro wrote:
> That should be
> 
> ^Subject:.*[list of all Chinese characters here]
> 
> except that if your list's preferred language is English and you haven't
> changed Mailman's character set for English from ASCII to UTF-8, the
> text you are matching against won't contain any Chinese characters
> because the decoded headers are converted to the character set of the
> list's preferred language and all the Chinese characters will be
> converted to '?'.
> 
> You might try something like
> 
> ^Subject:.*\?{4,}
> 
> This will match any subject that contains 4 or more non-ascii characters
> in a row. Unfortunately, it will also match
> 
> Subject: WTF is happening here????
> 
> but you could try some number other than 4 but greater than 1

How about using 'backslashreplace' instead of 'replace' to encode to
list's preferred language in Mailman/Handlers/SpamDetect.py ?

Then, desirable pattern in this case seems to be

~Subject.*(\\u[0-9a-f]{4}){4}

It also matches strings like 
'What does the string "\\u6709\\u9650\\u516c\\u53f8" mean?', though.

=== modified file 'Mailman/Handlers/SpamDetect.py'

--- Mailman/Handlers/SpamDetect.py      2016-01-18 23:56:58 +0000
+++ Mailman/Handlers/SpamDetect.py      2016-07-09 00:47:33 +0000
@@ -86,7 +86,7 @@
                 # unicode it as iso-8859-1 which may result in a garbled
                 # mess, but we have to do something.
                 uvalue += unicode(frag, 'iso-8859-1', 'replace')
-        headers += '%s: %s\n' % (h, uvalue.encode(cset, 'replace'))
+        headers += '%s: %s\n' % (h, uvalue.encode(cset, 'backslashreplace'))
     return headers

-- 
Yasuhito FUTATSUKI <futatuki at poem.co.jp>