[Tutor] Splitting strings into blocks

Mon May 1 14:57:27 CEST 2006

Daniel Watkins wrote:
> Hi list,
> I'm currently working on a program to parse LaTeX style maths expressions and 
> provide an answer. For example, I have the expression "2^\frac{1}{2}". I'm 
> trying to work out a way to split this into it's most basic blocks of LaTeX 
> (i.e. 2^ and \frac{1}{2}) while maintaining a record of the depth of the 
> expression (i.e. (2^,0),(\frac{1}{2},1)). I will then process this list from 
> the highest order downwards, feeding the deeper results progressively into 
> shallower elements until all have been calculated.
> LaTeX allows me to legally express the previously stated expression as 
> "{2^{\\frac{1}{2}}}". This makes it much easier to figure out where the units 
> of LaTeX are located. The depth of any item can now be expressed as the 
> number of unpaired opening or closing braces between the element and the 
> start or end of the expression.
> I'm essentially looking for a way to split the string up along the braces, 
> while recording the number of braces between the split and either end of the 
> expression.

First, I'll echo Danny's question - why do you need to do this?

For a general parser of LaTex expressions you will want to use a parsing 
package. I have found pyparsing to be pretty easy to use but there are 
many others. Someone may have solved this problem already.

To answer your specific question, here is code that uses re.split() to 
break an expression on the braces, then a simple loop through the 
results keeps track of nesting level and prints the depth of each token 
between the braces:

import re

data = r"{2^{\\frac{1}{2}}}"

depth = 0

for token in re.split(r'([{}])', data):
     if token == '{':
         depth += 1
     elif token == '}':
         depth -= 1
     elif token:
         print depth, token

Kent