[Mailman-Users] spam filtering messages containing certain 8 bit characters

Mark Sapiro mark at msapiro.net
Thu Oct 13 09:29:57 CEST 2011


On 10/12/2011 6:58 PM, William Yardley wrote:
> Does Mailman base64 decode the subject before applying a regex, and if
> so, can I use UTF-8 character names in the regex to match various
> types of 8-bit characters?


No. header filter rules regexps are matched against the raw headers. If
a header is RFC2047 encoded, it is not decoded.


> Say, for example, that I want to block messages with "电话卡" somewhere
> in the subject line.
> 
> Obviously, the actual raw Subject header will be more like:
> 
>  Subject: =?GB2312?B?[encoded stuff here]?=
>  Subject: =?utf-8?B?[encoded stuff here]?=
> 
> I tried putting in a regex to hold messages matching:
>  Subject: .*\u7535\u8bdd\u5361
> 
> And that didn't seem to work. As far as I can tell, there is no way to
> find a substring that will always match when the Subject header is
> base64 encoded.


I think this is correct. Each 3 bytes which are base64 encoded result in
a 4-character base64 substring. If the characters you are looking for
are encoded as a multiple of 3 bytes and begin on a 3-byte boundary,
they will encode to a unique base64 string, but if they don't begin and
end on a 3-byte boundary the base64 substring will be affected by what
comes before and/or after. Thus, I don't think you can reliably match,
even if you are only dealing with a single character set.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan



More information about the Mailman-Users mailing list