Does my RE rock, or suck?

Fri Jul 9 08:12:00 EDT 2004

Bryan Olson <fakeaddress at nowhere.org> writes:

> Thanks Michael.  I still have more investigating to do.
> 
> Michael Hudson wrote:
>  > Well, the re module is an NFA engine, not a DFA engine, so I'd expect
>  > this re to run in time proportional to the number of triggers.
> 
> I guess I didn't point this out except for my example: I tried
> to make the RE efficient even for a search-and-backtrack engine.
> Not only are there no loop-like things, there are never multiple
> choices that begin with the same character.  Any time it
> backtracks, it will fail immediately at every choice point.

Hmm, you sound like you know more about this sort of thing than I do
:-)

> In my example I compile the RE:
> 
>      '^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'
> 
> If the RE compiler is smart, searching should involve scanning
> the longest prefix that could still match, plus looking at just
> one character for each branch-point along the way.  (If it uses
> a jump-table at decision points, it should be truly linear.)

Well, I don't know much about the re compiler.  It's written in
Python, though, so it shouldn't be *that* hard to check for
yourself...

>  > "Dozens to hundreds" sounds like a fairly small number in context,
> 
> I agree, though hundreds would make the longest RE string I've
> ever used.
> 
>  > but: time it!  Only you know what fast enough is.
> 
> It seems fast.  Testing doesn't allay my fears of a bad worst-
> case that an adversary might discover.
>  > Here's a cute approach that should have running time independent
>  > of the number of triggers:
> [snip]
> 
> I've seen that kind of structure called a "trie" (Knuth cites
> the name 'trie' to E. Fredkin, CACM 3; 1960).  When I generate
> my RE, I sort of use the same trie structure.  

Heh, I'd sort of heard of tries but didn't realize I was implementing
one.  I just thought it was a simple minded DFA engine...

> My (trivial) tests so far indicate that the RE form is faster in
> typical cases.  Your trie-searcher is nice in that I have much more
> confidence in the worst-case run-time.

Well, the re engine is written in C, so you would hope that it beats
my Python engine...

The reason I original wrote code somewhat like what I put in my first
post is that I needed character-at-a-time processing.

>  > Here's a less cute but probably saner approach:
> [...]
>  > this should run in time proportional to the number of different
>  > lengths of trigger (and use rather less memory than my first version).
> 
> I'm not convinced.  The time to hash a string is proportional to
> the length, so I think searches can take time proportional to
> the sum of all the trigger lengths.

I guess, but the string hashing is done in C and is "probably"
insignificant.

> [I, Bryan asked:]
>  >>     Am I likely to exceed recursion depth?
>  >
>  > Not in 2.4 :-)  And I doubt it before that, actually.
> 
> But Python is only up to ... oh, the smiley.  I get it.

Well, my point was there is no recursion limit in Python 2.4's sre
engine.  But from my limited understanding, I don't think this type of
regular expression causes any version of sre to recurse.

Cheers,
mwh

-- 
  Need to Know is usually an interesting UK digest of things that
  happened last week or might happen next week. [...] This week,
  nothing happened, and we don't care.
                           -- NTK Now, 2000-12-29, http://www.ntk.net/