[Mailman-Users] Chinese characters spam filter?

Yasuhito FUTATSUKI futatuki at poem.co.jp
Wed Jul 13 00:15:17 EDT 2016


On 07/13/16 03:47, Mark Sapiro wrote:
> On 07/12/2016 12:03 AM, Stephen J. Turnbull wrote:
>> Mark Sapiro writes:
>>   > On 7/8/16 6:04 PM, Yasuhito FUTATSUKI wrote:
>>   > >
>>   > > How about using 'backslashreplace' instead of 'replace' to encode to
>>   > > list's preferred language in Mailman/Handlers/SpamDetect.py ?
>>
>> I see you've already done this, but ...
>>
>> I would consider xmlrefreplace as well.  xmlrefs are something most
>> people (users/moderators) have seen, backslash they're not going to
>> recognize unless they're programmers.
>
>
> I have now switched to xmlcharrefreplace instead of backslashreplace as
> I agree this will be easier to explain and understand. I was uncertain
> about this at first because I didn't know that xmlcharrefreplace
> wouldn't use entity names in some cases, but it appears that it only
> uses numeric references.

I don't have strong objection to switch to xmlcharrefreplace because my
main subject is to distinguish '?' from replaced characters.
But personally I prefer backslashreplace for looking up Unicode table,
for numeric reference of xmlcharreplace seems to use decimal, while
backslashreplace uses hexadecimal, and most of Unicode table uses
hexadecimal for express code point like U+4E8C.

>> At an earlier stage, you could also just do a trial re-encoding with
>> the list preferred codec, set errors = 'strict', catch the Exception,
>> and re-raise as a Hold (or Discard, according to per-list policy).
>> (Then discard the output.)  I would prefer this solution, I think, as
>> creating regexps turns out to be an issue for many list owners.
>>
>> People would have to learn not to use emoji in headers, of course, or
>> suffer moderation delays or even discards.
>
> I think this will have too many undesired effects. Not just emoji, but
> accented latin or CJK characters, etc. in display names would I think be
> real problems, even on English language lists.

I suggest to use variable to select handler from 'replace' (for backword
compatibility), 'xmlcharrefreplace', or 'backslashreplace' in mm_cfg.py.

I think it is better to hold string attributes of mm_cfg and mlist class
as Unicode than site_language code or list's preferred language code
encoded (but I know it is so trouble to do so).
-- 
Yasuhito FUTATSUKI <futatuki at poem.co.jp>


More information about the Mailman-Users mailing list