reverse engineering Excel spreadsheet

Sun Apr 1 23:30:43 EDT 2007

John Machin wrote:
> On Apr 2, 1:59 am, Duncan Smith <buzz... at urubu.freeserve.co.uk> wrote:
> 
>>Hello,
>>     I am currently implementing (mainly in Python) 'models' that come
>>to me as Excel spreadsheets, with little additional information.  I am
>>expected to use these models in a web application.  Some contain many
>>worksheets and various macros.
>>
>>What I'd like to do is extract the data and business logic so that I can
>>figure out exactly what these models actually do and code it up.  An
>>obvious (I think) idea is to generate an acyclic graph of the cell
>>dependencies so that I can identify which cells contain only data (no
>>parents) and those that depend on other cells.  If I could also extract
>>the relationships (functions), then I could feasibly produce something
>>in pure Python that would mirror the functionality of the original
>>spreadsheet (using e.g. Matplotlib for plots and more reliable RNGs /
>>statistical functions).
> 
> 
> There are two things you have to consider that can create those
> dependencies (1) formulas (2) macros. If it were only formulas, you
> wouldn't need to generate an acyclic graph. A no-parents cell will be
> defined by a type-specific record (e.g. NUMBER). Dependent cells are
> defined by a FORMULA etc record. Similar story with the XML format in
> Office 2007.
> 

Well yes, partitioning the cells according to format would be start.
Now you have me wondering about the new XML format.  Maybe that would be
easier to deal with?

> I'm not aware of any way to use Python to access the content of
> macros. It may be possible using COM ...
> 

I suppose I can do that manually.  Most of them seem to do with
presentation rather than the business logic.

> I'm a bit puzzled by your notion of creating a dependency graph
> *without* first extracting the "relationships (functions)" [which you
> could do only by parsing the formulas and macros].
> 

Incomplete parsing :-).  I thought I might be able to pull out the
dependencies relatively easily, then go back to the spreadsheet to
examine the formulae.

> "pure Python that would mirror the functionality of the original
> spreadsheet": mammoth job; speed?
> 

Basically to try and separate the data, business logic and presentation
for coding up in Python.  Particularly, I don't want to rely on Excel's
statistical functions.  But it's also possible that other similar models
will be making there way to me, and I'd rather have a few simple classes
to handle the business logic, and have model parameters etc. tucked away
in text files.

> Before you do that, have a look at (1) xlrd docs (2) xlrd source (3)
> Openoffice.org docs of XLS format (4) MS docs (now an ECMA standard)
> of Office 2007 XML-based files.
> 

Yes, will do.

> 
>>The final application will be running on a Linux server, but I can use a
>>Windows box (i.e. win32all) for processing the spreadsheets (hopefully
>>not manually).  Any advice on the feasibility of this, and how I might
>>achieve it would be appreciated.
>>
>>I assume there are plenty of people who have a better knowledge of e.g.
>>COM than I do.  I suppose an alternative would be to convert to Open
>>Office and use PyUNO, but I have no experience with PyUNO and am not
>>sure how much more reliable the statistical functions of Open Office
>>are.  At the end of the day, the business logic will not generally be
>>complex, it's extracting it from the spreadsheet that's awkward.  Any
>>advice appreciated.  TIA.  Cheers.
>>
>>Duncan
> 
> 
> PyUNO: Google around a bit. My experience was: tried it, got lost in
> the jungle of its documentation.
> 

Yep.  Me too.  I might have another go at it.  Cheers (to all who replied).

Duncan