split hangs

Thu Dec 28 05:15:29 EST 2000

[Jacek Pop³awski]
> ourregex=re.compile(r'(<(?:[^<>"]+(?:"[^"]*")?)+>)')

Nesting + and * operators is likely to yield a regexp that takes time
exponential in the length of the string, and especially when the overall
regexp doesn't match (the regexp tries all possible nested ways but none
works in the end).  Here's a small variation of Harald Kirsch's program;

import re, sys, time

boing = re.compile(r'(<(?:[^<>"]+(?:"[^"]*")?)+>)')

s = 'p < 0.05, all data in arbitrary'

for i in range(1, len(s)):
    t = s[:i]
    start = time.clock()
    boing.search(t)
    finish = time.clock()
    elapsed = finish - start
    print "%.3f seconds for '%s'" % (elapsed, t)

And here's as much as it printed before I got bored waiting for it:

0.000 seconds for 'p'
0.000 seconds for 'p '
0.000 seconds for 'p <'
0.000 seconds for 'p < '
0.000 seconds for 'p < 0'
0.000 seconds for 'p < 0.'
0.000 seconds for 'p < 0.0'
0.000 seconds for 'p < 0.05'
0.000 seconds for 'p < 0.05,'
0.000 seconds for 'p < 0.05, '
0.000 seconds for 'p < 0.05, a'
0.000 seconds for 'p < 0.05, al'
0.001 seconds for 'p < 0.05, all'
0.003 seconds for 'p < 0.05, all '
0.003 seconds for 'p < 0.05, all d'
0.007 seconds for 'p < 0.05, all da'
0.014 seconds for 'p < 0.05, all dat'
0.028 seconds for 'p < 0.05, all data'
0.086 seconds for 'p < 0.05, all data '
0.172 seconds for 'p < 0.05, all data i'
0.406 seconds for 'p < 0.05, all data in'
0.581 seconds for 'p < 0.05, all data in '
0.854 seconds for 'p < 0.05, all data in a'
1.708 seconds for 'p < 0.05, all data in ar'
3.412 seconds for 'p < 0.05, all data in arb'
6.836 seconds for 'p < 0.05, all data in arbi'

The tail end of the output is clear:  adding another character to the string
doubles the time it takes to figure out that the regexp doesn't match.  This
isn't a bug, it's the way Python (and Perl, and Emacs, and ...) regexps
work.  A full explanation is quite involved; you can find one in Jonathan
Friedl's excellent book "Mastering Regular Expressions" (O'Reilly).

For Harald's particular flavor of non-matching test data, the regexp can be
made very fast by adding a positive lookahead assertion, to make sure
there's *some* hope of eventually matching the ">":

boing = re.compile(r'(<(?=>)(?:[^<>"]+(?:"[^"]*")?)+>)')
                       ^^^^^ new lookahead inserted

A more robust way is to use what Friedl calls "unrolling":

boing = re.compile(r'(<[^<>"]+(?:"[^"]*"[^<>"]*)*>)')

I'm afraid that understanding why that pays off requires a non-trivial grasp
of how the matching engine works; see Friedl for a full explanation.

or-use-a-real-html-parser-ly y'rs  - tim