Parsing Excel spreadsheets

Fri Jan 2 19:04:32 EST 2009

On Jan 3, 2:01 am, brooklineTom <Brookline... at gmail.com> wrote:
> On Dec 31 2008, 9:56 am, John Machin <sjmac... at lexicon.net> wrote:

> > On Dec 31 2008, 4:02 pm, brooklineTom <Brookline... at gmail.com> wrote:
>
> > > andyh... at gmail.com wrote:
> > > > Hi,
>
> > > > Can anybody recommend an approach for loading and parsing Excel
> > > > spreadsheets in Python. Any well known/recommended libraries for this?
>
> > > > The only thing I found in a brief search was http://www.lexicon.net/sjmachin/xlrd.htm,
> > > > but I'd rather get some more input before going with something I don't
> > > > know.
>
> > > > Thanks,
> > > > Andy.
>
> > > I save the spreadsheets (in Excel) in xml format.
>
> > Which means that you need to be on a Windows box with a licensed copy
> > of Excel. I presume you talking about using Excel 2003 and saving as
> > "XML Spreadsheet (*.xml)". Do you save the files manually, or using a
> > COM script? What is the largest xls file that you've saved as xml, how
> > big was the xml file, and how long did it take to parse the xml file?
> > Do you extract formatting information or just cell contents?
>
> 1. The OP requested Excel files, by construction those must be
> generated with a licensed copy of Excel. I did the actual processing
> on both linux and windoze platforms.

My point was that however the original XLS files were created or
acquired, the first step in your solution involves converting the XLS
file to "XML Spreadsheet" format, which requires a copy of Excel on a
Windows box. Many people start with an XLS file, no Excel and no
Windows box, no COM, and users who can't be relied on to open a file
and save it in the right format with the right name and extension.

BTW, did you consider opening the XLS files with OpenOffice.org's Calc
and saving it in their default ods format (chunks of XML in a zip
file)?

> 3. The largest file I used was about 228M,

Is that the XLS file or the XML file?

> containing 36,393 hotel
> properties from Commission Junction. Each entry had 113 cells. The
> parsing overhead was minimal (on a per-entry basis) -- that's why I
> choose to use a pull-parser.
> 4. I extracted primarily cell contents, though I did some very limited
> format handling (looking for non-text fields and such).

I don't understand "looking for non-text fields" as "format handling".
To my way of thinking, knowing the data-type that Excel has assigned
to a cell is close to essential for effective use of the contents. And
you don't have to look very far: the ss:Type attribute tells you
whether a cell's content is String, Number, DateTime, Boolean, or
Error.

> > > xhtml. I know there are various python packages around that do it, but
> > > I found the learning curve of those packages to be steeper than just
> > > grokking the spreadsheet structure itself.
>
> > I'm curious to know which are the "various python packages" with the
> > so steep learning curves, and what the steep bits were.
>
> I looked, briefly, at xlrd. I found and scanned a few alternatives,
> though I don't remember what the others were. I needed something I
> could incorporate into my own application framework, and I knew I
> didn't need most of the formatting information. I'm not in any way
> criticizing xlrd, it's simply that, based on its API summary, it seems
> focused on problems I didn't have to solve.

Not so focused at all. The default behaviour of xlrd is to ignore
formatting info as much as possible. AFAICT this is the mode used by
most users.

> I knew that I needed only
> a small subset of the xlrd behavior, and I concluded (perhaps
> incorrectly) that it would be easier to roll my own parser than find,
> extract, and then port (to my own framework) the corresponding parts
> of xlrd.

Possibly incorrectly. If approached at the time, I would have said:
(a) if desperate to DIY:
(a1) ignore any code for old Excel versions (self.biff_version < 80)
(a2) ignore any code for extracting formatting info
(self.formatting_info)
(a3) find the Sheet.put_cell* methods in sheet.py e.g.

def put_cell(self, rowx, colx, ctype, value, xf_index):

ignore the xf_index arg and subvert them to your own needs instead of
filling up a big rectangular arena with data

(b) if not really so desperate, talk to me about implementing an
option in xlrd where callers can specify a callback to be used instead
of the Sheet.put_cell* methods

(c) What is all this "port to my own framework" caper anyway? If you
need to extract data from a database, do you rummage in their code
libraries and port the relevant bits to your own framework?

>
> I needed to extract the content of each row, cell by cell, and build
> data objects (in my framework) with the content of various cells. I
> also needed to build an "exception file" containing malformed entries
> that I could re-open with Excel after my code finished, so that the
> bogus entries could be manually corrected. What I mean by "malformed"
> entry is, for example, an address field that fails to correctly
> geocode or comment fields with confused utf8/unicode contents.

Building objects in your own framework and checking data integrity is
something that happens *after* you've got the basics for a cell (row
id, column id, data type, data value). You have dodgy postal
addresses? You're not alone, and it's not relevant to how you parse
the spreadsheet or even whether the data source was a spreadsheet or a
database query or a box of punched cards.

> My
> focus was on data content, as opposed to presentation. I needed to
> crack the cells into things like "string", "boolean", "float", and so
> on.

"Crack"? It's not the Enigma code. It's not even rot13. Spreadsheet
XML tells you the type (String, Boolean, ...). xlrd tells you the type
(XL_CELL_TEXT, XL_CELL_BOOLEAN, ...).

>
> Most importantly, I needed to do this one entry at a time -- I did
> *not* want to load the entire spreadsheet at once.
>
> I'm not saying that this couldn't be done with xlrd; only that I chose
> to roll my own and had minimal difficulty doing so.
>
> I hope this helps!

I think so. Thanks. You didn't directly address the "steep learning
curve" question, but you explained enough of where you came from.

It's a matter of "horses for courses". I guess some people might
regard (xml.dom, xml.dom.minidom, DIY "pullparser") as having a
slightly non-horizontal learning curve :-)

Cheers,
John