re.search much slower then grep on some regular expressions

Wed Jul 9 15:46:50 EDT 2008

samwyse wrote:
> On Jul 8, 11:01 am, Kris Kennaway <k... at FreeBSD.org> wrote:
>> samwyse wrote:
> 
>>> You might want to look at Plex.
>>> http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
>>> "Another advantage of Plex is that it compiles all of the regular
>>> expressions into a single DFA. Once that's done, the input can be
>>> processed in a time proportional to the number of characters to be
>>> scanned, and independent of the number or complexity of the regular
>>> expressions. Python's existing regular expression matchers do not have
>>> this property. "
> 
>> Hmm, unfortunately it's still orders of magnitude slower than grep in my
>> own application that involves matching lots of strings and regexps
>> against large files (I killed it after 400 seconds, compared to 1.5 for
>> grep), and that's leaving aside the much longer compilation time (over a
>> minute).  If the matching was fast then I could possibly pickle the
>> lexer though (but it's not).
> 
> That's funny, the compilation is almost instantaneous for me.

My lexicon was quite a bit bigger, containing about 150 strings and regexps.

> However, I just tested it to several files, the first containing
> 4875*'a', the rest each twice the size of the previous.  And you're
> right, for each doubling of the file size, the match take four times
> as long, meaning O(n^2).  156000*'a' would probably take 8 hours.
> Here are my results:

The docs say it is supposed to be linear in the file size ;-) ;-(

Kris