string parsing / regexp question

Ryan Krauss ryanlists at gmail.com
Wed Nov 28 18:12:51 EST 2007


Interesting.  Thanks Paul and Tim.  This looks very promising.

Ryan

On Nov 28, 2007 1:23 PM, Paul McGuire <ptmcg at austin.rr.com> wrote:
> On Nov 28, 11:32 am, "Ryan Krauss" <ryanli... at gmail.com> wrote:
> > I need to parse the following string:
> >
> > $$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
> >  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
> >  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
> >  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$
> >
> > The first thing I need to do is extract the arguments to \pmatrix{ }
> > on both the left and right hand sides of the equal sign, so that the
> > first argument is extracted as
> >
> > {\it x_2}\cr 0\cr 1\cr
> >
> > and the second is
> >
> > \left({{{\it m_2}\,s^2
> >  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
> >  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
> >  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr
> >
> > The trick is that there are extra curly braces inside the \pmatrix{ }
> > strings and I don't know how to write a regexp that would count the
> > number of open and close curly braces and make sure they match, so
> > that it can find the correct ending curly brace.
> >
>
> As Tim Grove points out, writing a grammar for this expression is
> really pretty simple, especially using the latest version of
> pyparsing, which includes a new helper method, nestedExpr.  Here is
> the whole program to parse your example:
>
> from pyparsing import *
>
>     data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
> \pmatrix{\left({{{\it m_2}\,s^2
>      }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
> m_2}\,s^2\,F
>      }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
> m_2}\,s^2}\over{k}}+1
>      \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""
>
>     PMATRIX = Literal(r"\pmatrix")
>     nestedBraces = nestedExpr("{","}")
>     grammar = "$$" + PMATRIX + nestedBraces + "=" + \
>                      PMATRIX + nestedBraces + \
>                      "$$"
>     res = grammar.parseString(data)
>     print res
>
> This prints the following:
>
> ['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
> '\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
> ['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
> ['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
> ['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
> 'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
> 'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']
>
> Okay, maybe this looks a bit messy.  But believe it or not, the
> returned results give you access to each grammar element as:
>
> ['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
> [nestedArgList], '$$']
>
> Not only has the parser handled the {} nesting levels, but it has
> structured the returned tokens according to that nesting.  (The '{}'s
> are gone now, since their delimiting function has been replaced by the
> nesting hierarchy in the results.)
>
> You could use tuple assignment to get at the individual fields:
>     dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res
>
> Or you could access the fields in res using list indexing:
>     lhs_args, rhs_args = res[2],res[5]
>
> But both of these methods will break if you decide to extend the
> grammar with additional or optional fields.
>
> A safer approach is to give the grammar elements results names, as in
> this slightly modified version of grammar:
>
>     grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
>                      PMATRIX + nestedBraces("rhs_args") + \
>                      "$$"
>
> Now you can access the parsed fields as if the results were a dict
> with keys "lhs_args" and "rhs_args", or as an object with attributes
> named "lhs_args" and "rhs_args":
>
>     res = grammar.parseString(data)
>     print res["lhs_args"]
>     print res["rhs_args"]
>     print res.lhs_args
>     print res.rhs_args
>
> Note that the default behavior of nestedExpr is to give back a nested
> list of the elements according to how the original text was nested
> within braces.
>
> If you just want the original text, add a parse action to nestedBraces
> to do this for you (keepOriginalText is another pyparsing builtin).
> The parse action is executed at parse time so that there is no post-
> processing needed after the parsed results are returned:
>
>     nestedBraces.setParseAction(keepOriginalText)
>     grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
>                      PMATRIX + nestedBraces("rhs_args") + \
>                      "$$"
>
>     res = grammar.parseString(data)
>     print res
>     print res.lhs_args
>     print res.rhs_args
>
> Now this program returns the original text for the nested brace
> expressions:
>
> ['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
> '{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
> \over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
> m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
> m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
> ['{{\\it x_2}\\cr 0\\cr 1\\cr }']
> ['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
> \\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
> m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
> m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }']
>
> You can find more info on pyparsing at http://pyparsing.wikispaces.com.
>
> Cheers!
> -- Paul
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list