Regex Speed

John Machin sjmachin at lexicon.net
Wed Feb 21 01:45:16 EST 2007


On Feb 21, 11:40 am, garri... at gmail.com wrote:
> On Feb 20, 4:15 pm, "John Machin" <sjmac... at lexicon.net> wrote:
>
> > What is an "exclusionary set"? It would help enormously if you were to
> > tell us what the regex actually is. Feel free to obfuscate any
> > proprietary constant strings, of course.
>
> My apologies. I don't have specifics right now, but it's something
> along the line of this:
>
> error_list = re.compile(r"error|miss|issing|inval|nvalid|math")
> exclusion_list = re.complie(r"No Errors Found|Premature EOF, stopping
> translate")
>
> for test_text in test_file:
>     if error_list.match(test_text) and not
> exclusion_list.match(test_text):
>         #Process test_text
>
> Yes, I know, these are not re expressions, but the requirements for
> the script specified that the error list be capable of accepting
> regular expressions, since these lists are configurable.

You could do a quick check on the list of patterns; if they are simple
strings (no ()+*?[] etc), then you could use Danny Yoo's ahocorasick
module (from http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/):

>>> import ahocorasick as ac
>>> gadget = ac.KeywordTree()
>>> for text in "error|miss|issing|inval|nvalid|math".split('|'):
...    gadget.add(text)
...
>>> gadget.make()
>>> def showall(machine, line):
...    startpos = 0
...    while 1:
...       result = machine.search(line, startpos)
...       if not result: return
...       beg, end = result
...       print beg, end, repr(line[beg:end])
...       startpos = beg + 1
...
>>> showall(gadget, 'the hissing misses invalidated erroneous mathematics')
5 11 'issing'
12 16 'miss'
19 24 'inval'
20 26 'nvalid'
41 45 'math'
>>> showall(gadget, 'the hissing misses invalidated terror mathematics')
5 11 'issing'
12 16 'miss'
19 24 'inval'
20 26 'nvalid'
32 37 'error'
38 42 'math'
>>> showall(gadget, 'these are not the droids that you are looking for')
>>>

But of course you would just use a simple search() per line -- I'm
just showing you that it works :-)

>
> > What system calls? Do you mean running grep as a subprocess?
>
> Yes. While this may not seem evil in and of itself, we are trying to
> get our company to adopt Python into more widespread use. I'm guessing
> the limiting factor isn't python, but us python newbies missing an
> obvious way to speed up the process.

You (can't/don't have to) write everything in Python or any other
single language. One of Pythons's very good points is that you can
connect it to anything -- or if you can't, just ask in this
newsgroup :-) Look, boss, yesterday we got it running grep; today
we're calling a module written in C; not bad for a bunch of newbies,
eh?

> And python 2.5.2.

2.5.2? Who needs crystal balls when you've got a time machine? Or did
you mean 2.5? Or 1.5.2 -- say it ain't so, Joe!

Cheers,
John





More information about the Python-list mailing list