Speeding up multiple regex matches

Talin viridia at gmail.com
Fri Nov 18 11:57:30 EST 2005


I've run in to this problem a couple of times. Say I have a piece of
text that I want to test against a large number of regular expressions,
where a different action is taken based on which regex successfully
matched. The naive approach is to loop through each regex, and stop
when one succeeds. However, I am finding this to be too slow for my
application -- currently 30% of the run time is being taken up in the
regex matching.

I thought of a couple of approaches, but I am not sure how to make them
work:

1) Combine all of the regular expressions into one massive regex, and
let the regex state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regex was the
matching one.

2) Use the first character of the text string to cut down the search
size. The problem here is that since I don't know the regex patterns in
advance, I would need to inspect each regex and determine the possible
set of starting characters that could be matched. This would require
creating my own regex parser.

3) Use a lexical scannner. This is probably overkill for what I want to
do.

4) Write my own regex system that does what I want. This is more work
than I want to do.

Any thoughts?




More information about the Python-list mailing list