[Mailman-Users] Mailman Spam Filters

Wed Feb 8 14:17:30 EST 2017

On 02/07/17 09:53, Barco van Rhijn wrote:
> I've put together a profanity word list that I'd like to block on our 
> mailing lists. I'm attempting to run this through the spam filters in 
> Mailman.
> Since these phrases are quite toxic to work with I've added a tame 
> example. ;-)
> 
> *I'm able to block words without an issue. But I'm having problems with 
> two things:*
> 
> 1. *I'm unable to match a phrase only.*
>       e.g. "she is mad"
> 
>      So far if I enter something like this it will match both words in 
> the phrase anywhere in a message.
> 
>   *      Hence an innocent user using the a phrase like "she is a
>     darling" would also be blocked.
>   *      As would someone mentioning the word "mad" in an non derogatory
>     way.

[...]

> Does anyone using this feature have advice for me?

Yes.  In brief, this is a hard, and possibly intractable, problem that
cannot be solved with regexes.  Any attempt to do so will be dependent
upon where you want to draw the line between false positives and false
negatives.  Bayesian filtering has a higher success rate, but is still
not fully reliable, and you will still have to choose where you want to
establish the balance between acceptable false positives and false
negatives.  Further, Mailman has no capability to use Bayesian filters.
(However, you could run incoming mail TO your lists through a Bayesian
filter before forwarding it to Mailman, wioth the same caveats.)

If there were an easy solution to this problem, *everyone*[1] would
already be using it.

Related anecdote:
One of my CS professors in college was once approached by the California
DMV to write them a piece of software that would automatically screen
vanity license plate applications for obscene or vulgar meanings.
"OK," he said.
"Including letter-number substitions."
"OK, that's easy."
"Forward and backward."
"OK."
"Including slang."
"Um, OK ..."
"And in all languages."

This was the point at which he told them that they were smoking crack.
And that, filtering a single "word", was a much simpler problem.

The only way you are reliably going to keep all profanity off your
mailing lists without false positives is manual moderation.  If you can
devise a strong AI to do the moderation, more power to you.  But you
cannot do it without smart natural-language processing.  You cannot do
it with 100% accuracy using regex.  Period.  Like XHTML parsing,[2] it
is not a problem that can be solved with regex.

[1]  Well, maybe everyone except 4chan.
[2]
https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

-- 
  Phil Stracchino
  Babylon Communications
  phils at caerllewys.net
  phil at co.ordinate.org
  Landline: 603.293.8485