Regexp speed (was Re: split hangs)

Sat Dec 30 23:51:18 EST 2000

Try reading the suggested "Mastering Regular Expressions" book by Jefferey Freidl (published by O'Reilly I think) There are 2 main types of regex engines: NFA and DFA.  I forget exactly what they stand for, but "N" is for non-deterministic and "D" is for deterministic.

NFA's can be faster and more flexible (like allowing backreferences)
but can be slow if the regex isn't crafted right (see your example).
A DFA, however, is faster if there is a lot of text and there is no
match.  

Python (and Perl, etc) use NFA's.  NFA's start with the given text
(your file) and starts to match it.  It starts with the first
character and works 1 character at a time until a char doesn't match.
Then it backs up until it can match that character or it determines no
match.  If it finds no match it will move ahead 1 character (from the
first one in this attempt) and try again.  If you have no match at all
in a long document (especially if you use repetition in the regex)
this will be very slow.  (Also note that repetition is usually greedy,
ie the regex "/\*.*\*/" to match C comments will match your entire
file from the first /* to the last */.

DFA's (it looks like emacs uses one) instead keeps track of all
possible ways of matching the current text simultaneously.  This means
that if there is a match, it is slower than an NFA with the properly
crafted regex.  Also, DFA's don't allow backreferences.  If there is
no match in a large piece of text the DFA will be faster.

I would strongly suggest reading Jefferey Friedl's book.  I have it
(though I haven't read the entire book) and found it to be very useful
and enlightening.  He will explain in much more completely and
accurately the operation of NFA and DFA regex engines and also how to
craft regexes that are correct and efficient.

HTH,
-D

On Fri, Dec 29, 2000 at 04:48:07PM +0100, davide at mercurio.localdomain wrote:
> In comp.lang.python, you wrote:
> 
> >The tail end of the output is clear:  adding another character to the string
> >doubles the time it takes to figure out that the regexp doesn't match.  This
> >isn't a bug, it's the way Python (and Perl, and Emacs, and ...) regexps
> >work.  A full explanation is quite involved; you can find one in Jonathan
> >Friedl's excellent book "Mastering Regular Expressions" (O'Reilly).
> 
> I tried the equivalent regexp in emacs
>   <\([^<>"]+\("[^"]*"\)?\)+>
> with bigger and bigger files and it does not seem to give an exponential
> behaviour.
> 
> In everything I read regexp matching is given as a linear time problem in
> n + m (where n is the size of the testing string and m the size of the
> regular expression).

How do you measure the size of a regex?  Is ".*" inifinite or
2?

> 
> I am a python newbye and I could not find the reason for this exp 
> behaviour. The regular expressions seems pretty standard to me, 
> and so there should be no need to go beyond regular languages and DFA. 
> 
> Would you explain why this algorithm is needed in python?
> Do really the other implementation (grep, perl, emacs, ecc...)
> use the same algorithm?
> 
> Thanks,
> 
> 				Davide