[Mailman-Users] Chinese characters spam filter?
Yasuhito FUTATSUKI
futatuki at poem.co.jp
Fri Jul 8 21:04:52 EDT 2016
Hi,
On 07/07/16 04:41, Mark Sapiro wrote:
> That should be
>
> ^Subject:.*[list of all Chinese characters here]
>
> except that if your list's preferred language is English and you haven't
> changed Mailman's character set for English from ASCII to UTF-8, the
> text you are matching against won't contain any Chinese characters
> because the decoded headers are converted to the character set of the
> list's preferred language and all the Chinese characters will be
> converted to '?'.
>
> You might try something like
>
> ^Subject:.*\?{4,}
>
> This will match any subject that contains 4 or more non-ascii characters
> in a row. Unfortunately, it will also match
>
> Subject: WTF is happening here????
>
> but you could try some number other than 4 but greater than 1
How about using 'backslashreplace' instead of 'replace' to encode to
list's preferred language in Mailman/Handlers/SpamDetect.py ?
Then, desirable pattern in this case seems to be
~Subject.*(\\u[0-9a-f]{4}){4}
It also matches strings like
'What does the string "\\u6709\\u9650\\u516c\\u53f8" mean?', though.
=== modified file 'Mailman/Handlers/SpamDetect.py'
--- Mailman/Handlers/SpamDetect.py 2016-01-18 23:56:58 +0000
+++ Mailman/Handlers/SpamDetect.py 2016-07-09 00:47:33 +0000
@@ -86,7 +86,7 @@
# unicode it as iso-8859-1 which may result in a garbled
# mess, but we have to do something.
uvalue += unicode(frag, 'iso-8859-1', 'replace')
- headers += '%s: %s\n' % (h, uvalue.encode(cset, 'replace'))
+ headers += '%s: %s\n' % (h, uvalue.encode(cset, 'backslashreplace'))
return headers
--
Yasuhito FUTATSUKI <futatuki at poem.co.jp>
More information about the Mailman-Users
mailing list