[Mailman-Users] Chinese characters spam filter?

Wed Jul 6 15:41:04 EDT 2016

On 7/6/16 10:51 AM, Greg Lindsay wrote:
> 
> I assume the text box that is asking to input a "Spam Filter Regexp"
> will attempt to match all text in the header.

The regexps match against a block of text which consists of the all of
the message and sub-part headers RFC 2047 decoded and separated by
newline characters and matched in multiline mode so that '^' matches the
beginning of the string or immediately following a newline.

> Since all headers
> include the text "Subject:" and that is the area of the header that I
> want to filter, this is why "^Subject:" is specified.

Correct.

> If I eliminate
> the literal asterisk and just change this to an asterisk, i.e.:
> "^Subject:*" that should take care of the space, right?

Regexps are not globs. Asterisk doesn't mean 0 or more of anything. It
is a repetition which means 0 or more of the preceding. "^Subject:*"
will match the beginning of the string or a newline followed by
'Subject' followed by 0 or more ':'.

You would want "^Subject:.*" to match Subject: followed by 0 or more of
any character. See
<https://docs.python.org/2/library/re.html#regular-expression-syntax>.

> Sometimes the
> mails come in with mixed Chinese and English characters, so if an
> English character is first in the subject and my filter specifies
> that it must be a space followed by a Chinese character, then the
> filter would fail to catch this...I think what is needed is this:
> 
> ^Subject:*[list of all Chinese characters here]

That should be

^Subject:.*[list of all Chinese characters here]

except that if your list's preferred language is English and you haven't
changed Mailman's character set for English from ASCII to UTF-8, the
text you are matching against won't contain any Chinese characters
because the decoded headers are converted to the character set of the
list's preferred language and all the Chinese characters will be
converted to '?'.

You might try something like

^Subject:.*\?{4,}

This will match any subject that contains 4 or more non-ascii characters
in a row. Unfortunately, it will also match

Subject: WTF is happening here????

but you could try some number other than 4 but greater than 1

> I don't understand the use of an equals sign in the regexp. Isn't
> this implied?

I was referring to an RFC 2047 encoded word which you were apparently
trying to match with

^Subject:\?utf-8\?B\?[56]

except the literal RFC2047 encoding would not be '?utf-8?B?...'. It
would be '=?utf-8?B?...'. I.e. the '=' is part of the string you would
be trying to match. See <https://www.rfc-editor.org/rfc/rfc2047.txt>.

However, you can't match RFC2047 encodings with header_filter_rules
because the headers you are matching against have already been RFC2047
decoded.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan