Problems with re

Tim Peters tim_one at email.msn.com
Sat May 22 03:53:13 EDT 1999


[Berthold Höllmann]
> I have a regular expression wich, im most cases does what I want it to
> do. But at least on one string it get's into an endless loop (OK I din't
> wait forever). See the attaced example:
>
> Python 1.5.2 (#2, Apr 22 1999, 14:34:42)  [GCC egcs-2.91.66 19990314
> (egcs-1.1.2  on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> import re
> >>> RE = re.escape
> >>> CP = r'\CallPython'
> >>> loopR = re.compile(
> ...     "(?:" + RE(CP) + r'(\[.*\])?{(?P<CodeC>(?:' + '""".*?"""|".*?"'
> + "|'''.*?'''|'.*?'|" +
> ...     '{.*?}+?|[^{]+?)+?))}',
> ...     re.MULTILINE|re.DOTALL)
> >>>
> >>> LL = loopR.match("\CallPython{LaTeXPy.PyLaTeX(dir(math))}")
> >>> LL = loopR.match("\CallPython{LaTeXPy.PyLaTeX({1:1,2:2,3:3}")
> >>> LL = loopR.match("\CallPython{LaTeXPy.PyLaTeX(dir(math));
> LaTeXPy.PyLaTeX({1:1,2:2,3:3}")
>
> I try to parse a LaTeX file for the included "\CallPython" statements to
> extract python commands from this statement.
>
> Do you have any idea?

Oh, several -- but you're not going to like them <wink>.

+ Regular expressions aren't powerful enough to match nested brackets.  So a
regexp approach to this problem is at best a sometimes-screws-up hack, no
matter how much more time you pour into it.

+ If you have to use regexps, at least use re.VERBOSE to make the mess more
readable; e.g.,

loopR = re.compile(r"""
(?: \\CallPython
    (\[.*\])?
    {
    (?P<CodeC>
        (?: \""".*?\"""
        |   ".*?"
        |   '''.*?'''
        |   '.*?'
        |   {.*?}+?
        |   [^{]+?
        )+?
    )
)
}
""", re.MULTILINE | re.DOTALL | re.VERBOSE)

This makes modification enormously easier, and makes some obscurities
obvious; e.g., by inspection, the outermost (?: ... ) serves no purpose so
can be removed.

+ You've got nested catch-almost-anything iterators, which can lead to
exponential match time.  That's what your "endless loop" is all about.  See
Friedl's "Master Regular Expressions" for details.

+ Most of the *pieces* of this regexp don't actually match what you want
them to match, except when you're lucky (which you often will be -- but
sometimes won't be).  See Friedl.

+ You're better off assuming the LaTeX is correct.  What would it really
hurt if your second example matched?  Throw caution to the wind and try this
instead:

loopR = re.compile(r"""
    \\CallPython
    (\[.*?\])?  # note that I made the guts a minimal match here
    {
    (?P<CodeC>
        .*?   # anything
    )
    }\s*$     # until finding a right brace at the end of a line
""", re.MULTILINE | re.DOTALL | re.VERBOSE)

This will screw up too, but the conditions under which it will are now
obvious <wink> and so easy to avoid; won't ever consume exponential time,
either.

if-a-regexp-is-longer-than-that-one-it's-wrong<0.9-wink>-ly y'rs  - tim






More information about the Python-list mailing list