reverse engineering Excel spreadsheet

Sun Apr 1 22:37:06 EDT 2007

On Apr 2, 1:59 am, Duncan Smith <buzz... at urubu.freeserve.co.uk> wrote:
> Hello,
>      I am currently implementing (mainly in Python) 'models' that come
> to me as Excel spreadsheets, with little additional information.  I am
> expected to use these models in a web application.  Some contain many
> worksheets and various macros.
>
> What I'd like to do is extract the data and business logic so that I can
> figure out exactly what these models actually do and code it up.  An
> obvious (I think) idea is to generate an acyclic graph of the cell
> dependencies so that I can identify which cells contain only data (no
> parents) and those that depend on other cells.  If I could also extract
> the relationships (functions), then I could feasibly produce something
> in pure Python that would mirror the functionality of the original
> spreadsheet (using e.g. Matplotlib for plots and more reliable RNGs /
> statistical functions).

There are two things you have to consider that can create those
dependencies (1) formulas (2) macros. If it were only formulas, you
wouldn't need to generate an acyclic graph. A no-parents cell will be
defined by a type-specific record (e.g. NUMBER). Dependent cells are
defined by a FORMULA etc record. Similar story with the XML format in
Office 2007.

I'm not aware of any way to use Python to access the content of
macros. It may be possible using COM ...

I'm a bit puzzled by your notion of creating a dependency graph
*without* first extracting the "relationships (functions)" [which you
could do only by parsing the formulas and macros].

"pure Python that would mirror the functionality of the original
spreadsheet": mammoth job; speed?

Before you do that, have a look at (1) xlrd docs (2) xlrd source (3)
Openoffice.org docs of XLS format (4) MS docs (now an ECMA standard)
of Office 2007 XML-based files.

>
> The final application will be running on a Linux server, but I can use a
> Windows box (i.e. win32all) for processing the spreadsheets (hopefully
> not manually).  Any advice on the feasibility of this, and how I might
> achieve it would be appreciated.
>
> I assume there are plenty of people who have a better knowledge of e.g.
> COM than I do.  I suppose an alternative would be to convert to Open
> Office and use PyUNO, but I have no experience with PyUNO and am not
> sure how much more reliable the statistical functions of Open Office
> are.  At the end of the day, the business logic will not generally be
> complex, it's extracting it from the spreadsheet that's awkward.  Any
> advice appreciated.  TIA.  Cheers.
>
> Duncan

PyUNO: Google around a bit. My experience was: tried it, got lost in
the jungle of its documentation.

HTH,
John