[Python-Dev] Tcl 8.1's regexp code (was RE: [Python-Dev] Why Foo is better than Baz)

Wed May 5 07:37:02 CEST 1999

I've consistently found that the best way to kill a thread is to rename it
accurately <wink>.

Agree w/ Guido that few people really care about the differing semantics.

Agree w/ Andrew that it's bad to pull a semantic switcheroo at this stage
anyway:  code will definitely break.  Like

    \b(?:
        (?P<keyword>and|if|else|...) |
        (?P<identifier>[a-zA-Z_]\w*)
    )\b

The (special)|(general) idiom relies on left-to-right match-and-out
searching of alternatives to do its job correctly.  Not to mention that \b
is not a word-boundary assertion in the new pkg (talk about pointlessly
irritating differences!  at least this one could be easily hidden via
brainless preprocessing).

Over the long run, moving to a DFA locks Python out of the directions Perl
is *moving*, namely embedding all sorts of runtime gimmicks in regexps that
exploit knowing the "state of the match so far".  DFAs don't work that way.
I don't mind losing those possibilities, because I think the regexp
sublanguage is strained beyond its limits already.  But that's a decision
with Big Consequences, so deserves some thought.

I'd definitely like the (sometimes dramatically) increased speed a DFA can
offer (btw, this code appears to use a lazily-generated DFA, to avoid the
exponential *compile*-time a straightforward DFA implementation can
suffer -- the code is very complex and lacks any high-level internal docs,
so we better hope Henry stays in love with it <0.5 wink>).

> ...
> My guess is that people don't so much translate long Perl regexp's
> to Python but simply transport their (always incomplete -- Larry Wall
> *wants* it that way :-) knowledge of Perl regexps to Python.

This is directly proportional to the number of feeble CGI programmers Python
attracts <wink>.  The good news is that they wouldn't know an NFA from a DFA
if Larry bit Henry on the ass ...

> My meta-guess is that this is also Henry Spencer's and John
> Ousterhout's guess.

I think Spencer strongly favors DFA semantics regardless of fashion, and
Ousterhout is a pragmatist.  So I trust JO's judgment more <0.9 wink>.

> As for Larry Wall, I guess he really doesn't care :-)

I expect he cares a lot!  Because a DFA would prevent Perl from going even
more insane in its present direction.

About the age of the code, postings to comp.lang.tcl have Henry saying he
was working on the alpha version intensely as recently as Decemeber ('98).
A few complaints about the alpha release trickled in, about regexp compile
speed and regexp matching speed in specific cases.  Perhaps paradoxically,
the latter were about especially simple regexps with long fixed substrings
(where this mountain of sophisticated machinery is likely to get beat cold
by an NFA with some fixed-substring lookahead smarts -- which latter Henry
intended to graft into this pkg too).

[Andrew]
> BTW, there's an interesting reference, I assume to this code, in
> _Mastering Regular Expressions_; Spencer is quoted on page 121 as
> saying it's "at worst quadratic in text size.".

[Guido]
> Not sure if that was the same code -- this is *new* code, not
> Spencer's old code.  I think Friedl's book is older than the current
> code.

I expect this is an invariant, though:  it's not natural for a DFA to know
where subexpression matches begin and end, and there's a pile of xxx_dissect
functions in regexec.c that use what strongly appear to be worst-case
quadratic-time algorithms for figuring that out after it's known that the
overall expression has *a* match.  Expect too, but don't know, that only
pathological cases are actually expensive.

Question:  has this package been released in any other context, or is it
unique to Tcl?  I searched in vain for an announcement (let alone code) from
Henry, or any discussion of this code outside the Tcl world.

whatever-happens-i-vote-we-let-them-debug-it<wink>-ly y'rs  - tim