parsing an Excel formula with the re module

Wed Jan 6 19:53:54 EST 2010

On Jan 6, 6:54 am, vsoler <vicente.so... at gmail.com> wrote:
> On 5 ene, 20:21, vsoler <vicente.so... at gmail.com> wrote:
>
>
>
> > On 5 ene, 20:05, Mensanator <mensana... at aol.com> wrote:
>
> > > On Jan 5, 12:35 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
>
> > > > vsoler wrote:
> > > > > Hello,
>
> > > > > I am acessing an Excel file by means of Win 32 COM technology.
> > > > > For a given cell, I am able to read its formula. I want to make a map
> > > > > of how cells reference one another, how different sheets reference one
> > > > > another, how workbooks reference one another, etc.
>
> > > > > Hence, I need to parse Excel formulas. Can I do it by means only of re
> > > > > (regular expressions)?
>
> > > > > I know that for simple formulas such as "=3*A7+5" it is indeed
> > > > > possible. What about complex for formulas that include functions,
> > > > > sheet names and possibly other *.xls files?
>
> > > > > For example    "=Book1!A5+8" should be parsed into ["=","Book1", "!",
> > > > > "A5","+","8"]
>
> > > > > Can anybody help? Any suggestions?
>
> > > > Do you mean "how" or do you really mean "whether", ie, get a list of the
> > > > other cells that are referred to by a certain cell, for example,
> > > > "=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5]
>
> > > Ok, although "Book1" would be the default name of a workbook, with
> > > default
> > > worksheets labeled "Sheet1". "Sheet2", etc.
>
> > > If I had a worksheet named "Sheety" that wanted to reference a cell on
> > > "Sheetx"
> > > OF THE SAME WORKBOOK, it would be =Sheet2!A7. If the reference was to
> > > a completely
> > > different workbook (say Book1 with worksheets labeled "Sheet1",
> > > "Sheet2") then
> > > the cell might have =[Book1]Sheet1!A7.
>
> > > And don't forget the $'s! You may see =[Book1]Sheet1!$A$7.
>
> > Yes, Mensanator, but...  what re should I use? I'm looking for the re
> > statement. No doubt you can help!
>
> > Thank you.
>
> Let me give you an example:
>
> >>> import re
> >>> re.split("([^0-9])", "123+456*/")
>
> [’123’, ’+’, ’456’, ’*’, ’’, ’/’, ’’]
>
> I find it excellent that one single statement is able to do a lexical
> analysis of an expression!

That is NOT lexical analysis.
>
> If the expression contains variables, such as A12 or B9, I can try
> another re expression. Which one should I use?
>
> And if my expression contains parenthesis?   And the sin() function?

 You need a proper lexical analysis, followed by a parser. What you
are trying to do can NOT be accomplished in any generality with a
single regex. The Excel formula syntax has several tricky bits. E.g.
IIRC whether TAX09 is a (macro) name or a cell reference depends on
what version of Excel you are targetting but if it appears like TAX09!
A1:B2 then it's a sheet name.

The xlwt package (of which I am the maintainer) has a lexer and parser
for a largish subset of the syntax ... see  http://pypi.python.org/pypi/xlwt