Does my RE rock, or suck?
Michael Hudson
mwh at python.net
Fri Jul 9 08:12:00 EDT 2004
Bryan Olson <fakeaddress at nowhere.org> writes:
> Thanks Michael. I still have more investigating to do.
>
> Michael Hudson wrote:
> > Well, the re module is an NFA engine, not a DFA engine, so I'd expect
> > this re to run in time proportional to the number of triggers.
>
> I guess I didn't point this out except for my example: I tried
> to make the RE efficient even for a search-and-backtrack engine.
> Not only are there no loop-like things, there are never multiple
> choices that begin with the same character. Any time it
> backtracks, it will fail immediately at every choice point.
Hmm, you sound like you know more about this sort of thing than I do
:-)
> In my example I compile the RE:
>
> '^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)'
>
> If the RE compiler is smart, searching should involve scanning
> the longest prefix that could still match, plus looking at just
> one character for each branch-point along the way. (If it uses
> a jump-table at decision points, it should be truly linear.)
Well, I don't know much about the re compiler. It's written in
Python, though, so it shouldn't be *that* hard to check for
yourself...
> > "Dozens to hundreds" sounds like a fairly small number in context,
>
> I agree, though hundreds would make the longest RE string I've
> ever used.
>
> > but: time it! Only you know what fast enough is.
>
> It seems fast. Testing doesn't allay my fears of a bad worst-
> case that an adversary might discover.
> > Here's a cute approach that should have running time independent
> > of the number of triggers:
> [snip]
>
> I've seen that kind of structure called a "trie" (Knuth cites
> the name 'trie' to E. Fredkin, CACM 3; 1960). When I generate
> my RE, I sort of use the same trie structure.
Heh, I'd sort of heard of tries but didn't realize I was implementing
one. I just thought it was a simple minded DFA engine...
> My (trivial) tests so far indicate that the RE form is faster in
> typical cases. Your trie-searcher is nice in that I have much more
> confidence in the worst-case run-time.
Well, the re engine is written in C, so you would hope that it beats
my Python engine...
The reason I original wrote code somewhat like what I put in my first
post is that I needed character-at-a-time processing.
> > Here's a less cute but probably saner approach:
> [...]
> > this should run in time proportional to the number of different
> > lengths of trigger (and use rather less memory than my first version).
>
> I'm not convinced. The time to hash a string is proportional to
> the length, so I think searches can take time proportional to
> the sum of all the trigger lengths.
I guess, but the string hashing is done in C and is "probably"
insignificant.
> [I, Bryan asked:]
> >> Am I likely to exceed recursion depth?
> >
> > Not in 2.4 :-) And I doubt it before that, actually.
>
> But Python is only up to ... oh, the smiley. I get it.
Well, my point was there is no recursion limit in Python 2.4's sre
engine. But from my limited understanding, I don't think this type of
regular expression causes any version of sre to recurse.
Cheers,
mwh
--
Need to Know is usually an interesting UK digest of things that
happened last week or might happen next week. [...] This week,
nothing happened, and we don't care.
-- NTK Now, 2000-12-29, http://www.ntk.net/
More information about the Python-list
mailing list