[Mailman-Users] filtering based on message content

Mark Sapiro mark at msapiro.net
Mon Jul 12 03:11:58 CEST 2010


Russell Clemings wrote:
>
>One other question: Is there an easy way to make it fire on parts of words
>as well as whole words? For example, I might want to catch "dig," "digger,"
>"digging," etc. (Not to mention "motherdigger.")



You can do pretty much any matching you want. For example

  \b(mother)?dig(ger|ging)?\b

would match 'motherdig', 'motherdigger', 'motherdigging', 'dig',
'digger' or 'digging', but it wouldn't match 'diggery' because the \b
at the end of the regexp says "there must be a word boundary here"
where a word boundary is the begining or end of the line or a
transition from the set of letters, digits and underscore to something
else, whereas

  \b(mother)?dig(ger\B*|ging)?\b

would also match 'diggery' and 'diggers'. It gets somewhat tricky. You
could just match 'dig' regardless of what follows or precedes it with
the regexp

  dig

but then you also match 'digest', 'indigent' and so forth. I know that
'dig' isn't actually the word you're targeting, but the same problem
exists with most simple words.

See <http://docs.python.org/library/re.html#regular-expression-syntax>
or perhaps <http://oreilly.com/catalog/9780596528126/>.

The original expression I gave you

BADWORDS = re.compile(r'(\W|^)word3(\W|$)|(\W|^)word6(\W|$)', re.I)

is a bit more complicated than it needs to be because (\W|^) and (\W|$)
could just as well be \b. Using the 'verbose' mode of regular
expressions that allows you to insert white space for readability, you
could have something like

BADWORDS = re.compile(r"""\bword3\b |
                          \bword6\b |
                          \b(mother)?dig(ger\B*|ging)\b
                        """, re.IGNORECASE | re.VERBOSE)

Then later you could decide to add \b(mother)?diggingest\b with minimal
editing like

BADWORDS = re.compile(r"""\bword3\b |
                          \bword6\b |
                          \b(mother)?diggingest\b |
                          \b(mother)?dig(ger\B*|ging)\b
                        """, re.IGNORECASE | re.VERBOSE)

Another way to do this is like

WORDLIST = [r'\bword3\b',
            r'\bword6\b',
            r'\b(mother)?diggingest\b',
            r'\b(mother)?dig(ger\B*|ging)\b',
           ]
BADWORDS = re.compile('|'.join(WORDLIST), re.IGNORECASE)

This just makes a list of simple regexps and then joins them with '|'
for the compiled re. In this case, re.VERBOSE isn't needed as we
introduce no insignificant white space.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan



More information about the Mailman-Users mailing list