Most efficient method to search text?

Thu Oct 17 18:25:08 EDT 2002

On Thu, 17 Oct 2002 11:14:09 GMT, Michael Hudson <mwh at python.net> wrote:

>Tim Peters <tim.one at comcast.net> writes:
>
>> Especially for purposes of building lexers, it might be useful if the re
>> package could recognize when a DFA approach was sufficient and practical,
>> and switch to a different scheme entirely then.  Or it might not.  Build
>> code to try it both ways, and let us know how it turns out ...
>
>Indeed, my code canes re when the wordlist gets long.  Here's some

If this is easy to add to your test harness, I'd be interested to see what
this search does in comparison, with the longer word lists (it's probably
faster than has_word.py that I posted elsewhere, depending on relative lengths
of word lists and strings, and internal vs external looping and allocation.

====================

# __PyPAN_Py word_is_in.py -- Check if at least one word in list is in string.
def word_is_in(
    wordlist,   # ['word', 'word2', 'etc']
    aString,    # string to find words in
    trans = ''.join([c.isalnum() and c or ' ' for c in map(chr,range(256))])
):
    """
    Search for any of a list of words in a string and return 1 if any found, else 0.
    """
    word_dict = dict(zip(wordlist,wordlist))
    for w in aString.translate(trans).split():
        if w in word_dict: return 1
    return 0
# __PyPAN__

====================
The trans table treats anything non-alphanumeric as word delimiter. You could mess with that.

You could write a fast C function to take advantage of not having actually to do
the translate and split, but instead just finding the successive tokens and checking
for them in the word_dict. For searching large files, this would help with memory use
as well as speed.

BTW, a split method for strings that would do a pre-translate could be useful.
You could specify by way of a character filter function like isalnum, which would
be used like for the trans argument above, but a builtin str.splitcs(character_set_filter)
would recognize references to its own standard methods (e.g., isalnum) and use cached
info instead of building it each time. Alternatively, a string could be assumed to imply
'['+the_string+']+' as a findall regex.

>numbers (do_comp(n) builds a list of n words composed of ten random
>ascii characters and searches the Python standard library for them
>using first my code, then a re-based approach):
>
>>>> import robin
>>>> robin.do_comp(10)
>2.275832057
>0.317215919495
>>>> robin.do_comp(100)
>2.2694439888
>0.980538964272
>>>> robin.do_comp(1000)
>2.36677598953
>8.53031408787
>>>> robin.do_comp(2000)
>2.34253299236
>16.9222760201
>
>So roughly linear behaviour from re, n-independent behaviour from my
>code.  Isn't it nice when things do what you expect?
>
>On the other hand, the "compiler" code I posted yesterday consumes
>vast gobs of memory -- trying to compile a 10000 long list of random
>words got the OOM killer into action.  A version using lists seems
>slightly better in this respect, though slower in execution.
>

Regards,
Bengt Richter