Regex Speed

Tue Feb 20 18:15:18 EST 2007

On Feb 21, 8:29 am, garri... at gmail.com wrote:
> While creating a log parser for fairly large logs, we have run into an
> issue where the time to process was relatively unacceptable (upwards
> of 5 minutes for 1-2 million lines of logs). In contrast, using the
> Linux tool grep would complete the same search in a matter of seconds.
>
> The search we used was a regex of 6 elements "or"ed together, with an
> exclusionary set of ~3 elements.

What is an "exclusionary set"? It would help enormously if you were to
tell us what the regex actually is. Feel free to obfuscate any
proprietary constant strings, of course.

> Due to the size of the files, we
> decided to run these line by line,

I presume you mean you didn't read the whole file into memory;
correct? 2 million lines doesn't sound like much to me; what is the
average line length and what is the spec for the machine you are
running it on?

> and due to the need of regex
> expressions, we could not use more traditional string find methods.
>
> We did pre-compile the regular expressions, and attempted tricks such
> as map to remove as much overhead as possible.

map is a built-in function, not a trick. What "tricks"?

>
> With the known limitations of not being able to slurp the entire log
> file into memory, and the need to use regular expressions, do you have
> an ideas on how we might speed this up without resorting to system
> calls (our current "solution")?

What system calls? Do you mean running grep as a subprocess?

To help you, we need either (a) basic information or (b) crystal
balls. Is it possible for you to copy & paste your code into a web
browser or e-mail/news client? Telling us which version of Python you
are running might be a good idea too.

Cheers,
John