Most efficient method to search text?

Michael Hudson mwh at python.net
Wed Oct 16 11:19:58 EDT 2002


Jeff Epler <jepler at unpythonic.net> writes:

> On Wed, Oct 16, 2002 at 01:35:09PM +0000, Michael Hudson wrote:
> > Here's a way to quickly (I hope! Haven't done any benchmarks) tell if
> > one of a bunch of words is contained in a chunk of text, assuming the
> > words are known beforehand [...]
> 
> Is there any reason to suppose that this is more efficient than using
> re.compile("|".join(re.escape(words))).match?  

No, and it's not (once you fiddle what you said to 

    re.compile("|".join(
        ['\\b'+re.escape(word)+'\\b' for word in words])).search

so it does roughly the same thing).

> I haven't looked at the implementation of sre, but it should be able
> to generate a simple DFA for this RE and execute something very much
> like your 'match' function, but at C speeds.

Yup.  Dunno if it does either, but it's quick.

> Having one dict lookup per character scanned seems like a pretty huge
> chunk of overhead.

The original reason I started writing code like this was that I needed
to feed in the string a character at a time and there was enough else
going on to easily swamp one measly dict lookup.  Another difference
is that it's easy to fiddle my scheme into telling you which word
matched (this isn't exactly hard with re either).

I can't think of many ways of speeding it up in Python -- perhaps
running the text through map(ord, text) and using lists?  Doesn't seem
to make much difference.  psyco seems to make more difference, and I
guess playing for psyco a bit could get more (these are all eyeball
tests, no numbers I'm afraid).

Anyway, what I thought was a neat bit of code was the point...

Cheers,
M.

-- 
  Whaaat? That is the most retarded thing I have seen since, oh,
  yesterday                             -- Kaz Kylheku, comp.lang.lisp



More information about the Python-list mailing list