Problems with re

Tim Peters tim_one at email.msn.com
Sun May 23 03:01:51 EDT 1999


[Berthold Höllmann, with a long regexp that "takes forever" in some cases,
 and is full of surprises anyway]

[Tim]
> + Regular expressions aren't powerful enough to match nested
>   brackets.  So a regexp approach to this problem is at best a
>   sometimes-screws-up hack, no matter how much more time you pour
>   into it.

[Berthold]
> What would be your recommendation instead?

Up to you!  If you can live with errors but can't live with exponential
matching time, keep the regexp as simple as the alternative I posted at the
end.

If you can't live with errors, you're going to have to parse the Python "for
real"; regexps are great for lexical classification but hopeless for real
parsing.   The easiest way to do that is to use a very simple regexp to find
\CallPython sites in the LaTex, then use Lib/tokenize.py to parse the Python
part.  tokenize.py isn't particularly easy to learn how to use, but it's
bulletproof, and easier to learn than any of the general-purpose parsing
systems.

Looks like you want to find the first unmatched right brace.  So your
tokeneater function can ignore everything except "{" and "}" tokens,
incrementing a depth counter by 1 when it sees the former and decrementing
when it sees the latter.  If the depth counter is already 0 when it sees a
"}", you've found the closing LaTeX brace, or the Python is buggy.  tokenize
will handle strings and continuation lines etc correctly without any work on
your part.

Or write a character-at-a-time loop yourself that skips over strings and
counts braces.  It's much easier to write something like that than a hairy
regexp.

even-better-it-works-ly y'rs  - tim


PS:  If these are *your* LaTeX constructions you're trying to parse, how
about finessing the problem out of existence by defining trivial-to-find
beginPython/endPython pairs instead?






More information about the Python-list mailing list