[Tutor] Amazing power of Regular Expressions...

Mon Nov 6 02:08:53 CET 2006

"Michael Sparks" <ms at cerenity.org> wrote

> The most pathological example of regex avoidance I've seen in a 
> while
> is this:
>
> def isPlain(text):
>    plaindict = {'-': True, '.': True, '1': True, '0': True, '3': 
> True,
>      '2': True, '5': True, '4': True, '7': True, '6': True, '9': 
> True,
>      '8': True, 'A': True, 'C': True, 'B': True, 'E': True, 'D': 
> True,
>      'G': True, 'F': True, 'I': True, 'H': True, 'K': True, 'J': 
> True,
>      'M': True, 'L': True, 'O': True, 'N': True, 'Q': True, 'P': 
> True,
>      'S': True, 'R': True, 'U': True, 'T': True, 'W': True, 'V': 
> True,
>      'Y': True, 'X': True, 'Z': True, '_': True, 'a': True, 'c': 
> True,
>      'b': True, 'e': True, 'd': True, 'g': True, 'f': True, 'i': 
> True,
>      'h': True, 'k': True, 'j': True, 'm': True, 'l': True, 'o': 
> True,
>      'n': True, 'q': True, 'p': True, 's': True, 'r': True, 'u': 
> True,
>      't': True, 'w': True, 'v': True, 'y': True, 'x': True, 'z': 
> True}
>
>    for c in text:
>        if plaindict.get(c, False) == False:
>            return False
>    return True
>
> (sadly this is from real code - in defence of the person
> who wrote it, they weren't even *aware* of regexes)
>
> That's equivalent to the regular expression:
>    * ^[0-9A-Za-z_.-]*$

While using a dictionary is probably overkill, so is a regex.
A simple string holding all characters and an 'in' test would probably
be both easier to read and faster. Which kind of illustrates the
point of the thread I think! :-)

> Now, which is clearer? If you learn to read & write regular 
> expressions, then
> the short regular expression is the clearest form. It's also 
> quicker.

Whether its quicker will depend on several factors including the
implementation of the regex library as well as the length of the 
string.
If its a single char I'd expect the dictionary lookup to be faster 
than
a regex parse or the string inclusion test... In fact this is how the
C standard library usually implements functions like toupper()
and tolower() etc, and for speed reasons.

> to say "don't use them if there's an alternative" is a little 
> strong.
> Aside from the argument that "you now have two problems"
> (which always applies if you think all problems can be hit with
> the same hammer), solving *everything* with regex is often slower.

regex can be faster than a sequential string search. It depends on
the problem.

The thing that we are all trying to say here (I think) is that regex
are powerful tools but dangerously complex. Its nearly always
safer and easier to use alternatives where they exist, but when
used intelligently they can solve difficult problems very elegantly.

> JWZ's quote is more aimed at people who think about solving
> every problem with regexes (and where you end up with 10 line
> monstrosities in perl with 5 levels of backtracking).

Agreed and thats what the message of the thread is about.
Use them ewhen they are the right solution, but look for
altrernatives first.

> Also, it's worth bearing in mind that there's more than one 
> definition of what
> regex's are

To be picky, there is only one definition of what regexd are,
but there are many grammars or dialects.

> If your reaction to seeing a problem is "this looks like it can be 
> solved
> using a regex", you should think to yourself: has someone else 
> already hit
> this problem and have they come up with a specialised pattern 
> matcher for it
> already? If not, why not?

Absolutely agree with this.

> :-)

Likewise :-)

Alan g.