How to check if any item from a list of strings is in a big string?

Scott David Daniels Scott.Daniels at Acm.Org
Tue Jul 14 19:00:03 EDT 2009


Nobody wrote:
> On Tue, 14 Jul 2009 02:06:04 -0300, Gabriel Genellina wrote:
> 
>>> Matt, how many words are you looking for, in how long a string ?
>>> Were you able to time any( substr in long_string ) against re.compile
>>> ( "|".join( list_items )) ?
>> There is a known algorithm to solve specifically this problem  
>> (Aho-Corasick), a good implementation should perform better than R.E. (and  
>> better than the gen.expr. with the advantage of returning WHICH string  
>> matched)
> 
> Aho-Corasick has the advantage of being linear in the length of the
> patterns, so the setup may be faster than re.compile(). The actual
> searching won't necessarily be any faster (assuming optimal
> implementations; I don't know how safe that assumption is).
> 
Having done a fast Aho-Corasick implementation myself, I can assure you
that the actual searching can be incredibly fast.  RE conversion usually
goes to a slightly more general machine than the Aho-Corasick processing
requires.

--Scott David Daniels
Scott.Daniels at Acm.Org




More information about the Python-list mailing list