string parsing / regexp question

Paul McGuire ptmcg at austin.rr.com
Wed Nov 28 14:23:44 EST 2007


On Nov 28, 11:32 am, "Ryan Krauss" <ryanli... at gmail.com> wrote:
> I need to parse the following string:
>
> $$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
>  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
>  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
>  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$
>
> The first thing I need to do is extract the arguments to \pmatrix{ }
> on both the left and right hand sides of the equal sign, so that the
> first argument is extracted as
>
> {\it x_2}\cr 0\cr 1\cr
>
> and the second is
>
> \left({{{\it m_2}\,s^2
>  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
>  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
>  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr
>
> The trick is that there are extra curly braces inside the \pmatrix{ }
> strings and I don't know how to write a regexp that would count the
> number of open and close curly braces and make sure they match, so
> that it can find the correct ending curly brace.
>

As Tim Grove points out, writing a grammar for this expression is
really pretty simple, especially using the latest version of
pyparsing, which includes a new helper method, nestedExpr.  Here is
the whole program to parse your example:

from pyparsing import *

    data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
\pmatrix{\left({{{\it m_2}\,s^2
     }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
m_2}\,s^2\,F
     }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
m_2}\,s^2}\over{k}}+1
     \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""

    PMATRIX = Literal(r"\pmatrix")
    nestedBraces = nestedExpr("{","}")
    grammar = "$$" + PMATRIX + nestedBraces + "=" + \
                     PMATRIX + nestedBraces + \
                     "$$"
    res = grammar.parseString(data)
    print res

This prints the following:

['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
'\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']

Okay, maybe this looks a bit messy.  But believe it or not, the
returned results give you access to each grammar element as:

['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
[nestedArgList], '$$']

Not only has the parser handled the {} nesting levels, but it has
structured the returned tokens according to that nesting.  (The '{}'s
are gone now, since their delimiting function has been replaced by the
nesting hierarchy in the results.)

You could use tuple assignment to get at the individual fields:
    dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res

Or you could access the fields in res using list indexing:
    lhs_args, rhs_args = res[2],res[5]

But both of these methods will break if you decide to extend the
grammar with additional or optional fields.

A safer approach is to give the grammar elements results names, as in
this slightly modified version of grammar:

    grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
                     PMATRIX + nestedBraces("rhs_args") + \
                     "$$"

Now you can access the parsed fields as if the results were a dict
with keys "lhs_args" and "rhs_args", or as an object with attributes
named "lhs_args" and "rhs_args":

    res = grammar.parseString(data)
    print res["lhs_args"]
    print res["rhs_args"]
    print res.lhs_args
    print res.rhs_args

Note that the default behavior of nestedExpr is to give back a nested
list of the elements according to how the original text was nested
within braces.

If you just want the original text, add a parse action to nestedBraces
to do this for you (keepOriginalText is another pyparsing builtin).
The parse action is executed at parse time so that there is no post-
processing needed after the parsed results are returned:

    nestedBraces.setParseAction(keepOriginalText)
    grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
                     PMATRIX + nestedBraces("rhs_args") + \
                     "$$"

    res = grammar.parseString(data)
    print res
    print res.lhs_args
    print res.rhs_args

Now this program returns the original text for the nested brace
expressions:

['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
'{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
['{{\\it x_2}\\cr 0\\cr 1\\cr }']
['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
\\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }']

You can find more info on pyparsing at http://pyparsing.wikispaces.com.

Cheers!
-- Paul



More information about the Python-list mailing list