Most efficient method to search text?

Fri Oct 18 15:59:23 EDT 2002

[Michael Hudson, searching for words]
> ...
> Indeed, my code canes re when the wordlist gets long.  Here's some
> numbers ...
> ...
> So roughly linear behaviour from re, n-independent behaviour from my
> code.  Isn't it nice when things do what you expect?

When you're young, it can seem that way.  At my age, it mostly makes one
nervous, waiting for the other and always larger and much smellier foot to
fall <wink>.

> On the other hand, the "compiler" code I posted yesterday consumes
> vast gobs of memory -- trying to compile a 10000 long list of random
> words got the OOM killer into action.  A version using lists seems
> slightly better in this respect, though slower in execution.

Which is, of course, the traditional tradeoff between NFA and DFA
approaches -- in bad cases you can pay in memory and preprocessing time due
to exponential state explosion, or you can pay at runtime.

I see Bengt discovered how to use dicts for this, and that's probably what
you'll use in the future too <wink>.  For the more general case of lexing,
you're searching for patterns (like r'[A-Za-z_]\w*'), and twisting a dict
gets at best obscure.  There's a large literature now on compromise
approaches, building only the parts of the DFA that the target string
actually needs to explore, in a dynamic way.  This sounds horribly expensive
at first glance (don't think about that mixing eyes with ears doesn't make
sense), but in some practical cases there are exceedingly clever ways to
"fake it" with a handful of machine-level instructions per character, using
integers as bit vectors representing the *set* of NFA nodes currently
active.

Gonzo speeding of all arguably common cases of pattern-matching is a
lifetime job.  Although there's nothing unique about pattern-matching in
that ...

learning-to-live-with-fast-enough-leaves-more-time-for-life-ly y'rs  - tim