Cannot formulate regex

Paul McGuire ptmcg at austin.rr.com
Sun Sep 16 11:36:42 EDT 2007


On Sep 16, 10:18 am, "Dotan Cohen" <dotanco... at gmail.com> wrote:
> I'd like to filter spam from a certain company. Here are examples of
> strings found in their spam:
> Mega Dik
> Mega D1k
> MegaDik
> Mega. Dik
> M eg ad ik
> M E _G_A_D_ IK
> M_E_G. ADI. K
>
> I figured that this regex would match all but the second example, yet
> it matches none:
> |[^a-z]m[^a-z]e[^a-z]g[^a-z]a[^a-z]d[^a-z]i[^a-z]k[^a-z]|i
>
> What would be the regex that matches "megadik" regardless of whatever
> characters are sprinkled throughout?
>
> Thanks in advance.
>
> Dotan

In your regex, every occurrence of "[^a-z]" requires a single
character not in the a-z range.  So what you have *should* match
"M*E*G*A*D*I*K" (an unfortunate pr0n sequel to "M*A*S*H"?), but not
any of your examples.  You will need to add an '*' character to your
[^a-z]'s, as in:

[^a-z]*m[^a-z]*e[^a-z]*g[^a-z]*a[^a-z]*d[^a-z]*i[^a-z]*k[^a-z]*

to indicate "0 or more" repetitions of [^a-z].

Also, I would omit the leading and trailing "[^a-z]*"s - I think they
will significantly slow down your regex.

-- Paul




More information about the Python-list mailing list