Repeating assertions in regular expression

candide candide at free.invalid
Tue Jan 3 03:46:32 EST 2012


The regular expression HOWTO 
(http://docs.python.org/howto/regex.html#more-metacharacters) explains 
the following

# ------------------------------
zero-width assertions should never be repeated, because if they match 
once at a given location, they can obviously be matched an infinite 
number of times.
# ------------------------------


Why the wording is "should never" ? Repeating a zero-width assertion is 
not forbidden, for instance :

 >>> import re
 >>> re.compile("\\b\\b\w+\\b\\b")
<_sre.SRE_Pattern object at 0xb7831140>
 >>>

Nevertheless, the following doesn't execute :

 >>> re.compile("\\b{2}\w+\\b\\b")
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/lib/python2.7/re.py", line 190, in compile
     return _compile(pattern, flags)
   File "/usr/lib/python2.7/re.py", line 245, in _compile
     raise error, v # invalid expression
sre_constants.error: nothing to repeat
 >>>


\\b\\b and \\b{2} aren't equivalent ?


Surprisingly, the engine doesn't optimize repeated boundary assertions, 
for instance

# ------------------------------
import re
import time

a=time.clock()
len("\\b\\b\\b"*100000+"\w+")
b=time.clock()
print "CPU time : %.2f s" %(b - a)

a=time.clock()
re.compile("\\b\\b\\b"*100000+"\w+")
b=time.clock()
print "CPU time : %.2f s" %(b - a)
# ------------------------------

outputs:

# ------------------------------
CPU time : 0.00 s
CPU time : 1.33 s
# ------------------------------


Your comments are welcome!



More information about the Python-list mailing list