Python RegExp

Tue Mar 22 18:41:33 EST 2005

davidchipping at gmail.com wrote:

>I have a string which I wish to match using RE, however when I run my
> comparison (using the match method) on my machine it never returns,
> using the CPU fully.
>
> The code is (very simply):
>
> ------------------------------
> import re
>
> buffer = r"#1 1 550 111 SYNC_PEER RES
> <YP>syncpeers=(id=54325432;add=10." \
>
> "0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port=543;up=543)," \
>            "(id=54325432;add=10.0.0.1;port=89;up=8"
>
> p = re.compile(r'#(?P<tran>[0-9]+\s)(?P<sess>[0-9]+\s)' \
>                + '(?P<ndto>[0-9]+\s)(?P<ndfr>[0-9]+\s)' \
>                + '(?P<cmd>[a-zA-Z_]+\s)(?P<dir>[a-zA-Z_]+)' \
>                + '(?P<parm>(\s<[a-zA-Z]+>(\s*[a-zA-Z]+=[a-zA' \
>                + '-Z0-9.,=();]+\s*)+<[a-zA-Z]+>)*)#(?P<left>.*$)')
>
> print "starting the match"
> rst = (p.match (buffer) != None)
> print "finishing the match"
> --------------------------------
>
> Should this every actually happen? Regardless of whether the statment
> matches or not, surely it should do this?

surely the following should never be slow?

    n = len(s)
    for a in range(n):
        for b in range(n):
            for c in range(n):
                for d in range(n):
                    for e in range(n):
                            ... etc ...

(your RE contains several nested and/or related repeat operators,
which forces the poor engine to test zillions of combinations before
giving up. see Friedl's RE book for more information on this)

I suggest using a real parser for this task (or using the RE engine as
a tokenizer, and doing the rest in code; see
http://effbot.org/zone/xml-scanner.htm for more on this)

</F>