Efficient way of testing for substring being one of a set?

Jeff jeffober at gmail.com
Thu Apr 3 08:59:33 EDT 2008


On Apr 3, 8:19 am, George Sakkis <george.sak... at gmail.com> wrote:
> On Apr 3, 8:03 am, Jeff <jeffo... at gmail.com> wrote:
>
> > def foo(sample, strings):
> >         for s in strings:
> >                 if sample in s:
> >                         return True
> >         return False
>
> > This was an order of magnitude faster for me than using str.find or
> > str.index.  That was finding rare words in the entire word-list (w/
> > duplicates) of War and Peace.
>
> If you test against the same substrings over and over again, an
> alternative would be to build a regular expression:
>
> import re
> search = re.compile('|'.join(re.escape(x)
>                              for x in substrings)).search
> p = search(somestring)
> if p is not None:
>   print 'Found', p.group()
>
> George

That would be an enormous regular expression and eat a lot of memory.
But over an enormous number of substrings, it would be O(log n),
rather than O(n).



More information about the Python-list mailing list