Regex Speed

garrickp at gmail.com garrickp at gmail.com
Tue Feb 20 19:40:40 EST 2007


On Feb 20, 4:15 pm, "John Machin" <sjmac... at lexicon.net> wrote:

> What is an "exclusionary set"? It would help enormously if you were to
> tell us what the regex actually is. Feel free to obfuscate any
> proprietary constant strings, of course.

My apologies. I don't have specifics right now, but it's something
along the line of this:

error_list = re.compile(r"error|miss|issing|inval|nvalid|math")
exclusion_list = re.complie(r"No Errors Found|Premature EOF, stopping
translate")

for test_text in test_file:
    if error_list.match(test_text) and not
exclusion_list.match(test_text):
        #Process test_text

Yes, I know, these are not re expressions, but the requirements for
the script specified that the error list be capable of accepting
regular expressions, since these lists are configurable.

> I presume you mean you didn't read the whole file into memory;
> correct? 2 million lines doesn't sound like much to me; what is the
> average line length and what is the spec for the machine you are
> running it on?

You are correct. The individual files can be anywhere from a few bytes
to 2gig. The average is around one gig, and there are a number of
files to be iterated over (an average of 4). I do not know the machine
specs, though I can safely say it is a single core machine, sub
2.5ghz, with 2gigs of RAM running linux.

> map is a built-in function, not a trick. What "tricks"?

I'm using the term "tricks" where I may be obfuscating the code in an
effort to make it run faster. In the case of map, getting rid of the
interpreted for loop overhead in favor of the implied c loop offered
by map.

> What system calls? Do you mean running grep as a subprocess?

Yes. While this may not seem evil in and of itself, we are trying to
get our company to adopt Python into more widespread use. I'm guessing
the limiting factor isn't python, but us python newbies missing an
obvious way to speed up the process.

> To help you, we need either (a) basic information or (b) crystal
> balls. Is it possible for you to copy & paste your code into a web
> browser or e-mail/news client? Telling us which version of Python you
> are running might be a good idea too.

Can't copy and paste code (corp policy and all that), no crystal balls
for sale, though I hope the above information helps. Also, running a
trace on the program indicated that python was spending a lot of time
looping around lines, checking for each element of the expression in
sequence.

And python 2.5.2.

Thanks!





More information about the Python-list mailing list