Parsing Excel spreadsheets

Fri Jan 2 10:01:33 EST 2009

On Dec 31 2008, 9:56 am, John Machin <sjmac... at lexicon.net> wrote:
> On Dec 31 2008, 4:02 pm, brooklineTom <Brookline... at gmail.com> wrote:
>
> > andyh... at gmail.com wrote:
> > > Hi,
>
> > > Can anybody recommend an approach for loading and parsing Excel
> > > spreadsheets in Python. Any well known/recommended libraries for this?
>
> > > The only thing I found in a brief search washttp://www.lexicon.net/sjmachin/xlrd.htm,
> > > but I'd rather get some more input before going with something I don't
> > > know.
>
> > > Thanks,
> > > Andy.
>
> > I save the spreadsheets (in Excel) in xml format.
>
> Which means that you need to be on a Windows box with a licensed copy
> of Excel. I presume you talking about using Excel 2003 and saving as
> "XML Spreadsheet (*.xml)". Do you save the files manually, or using a
> COM script? What is the largest xls file that you've saved as xml, how
> big was the xml file, and how long did it take to parse the xml file?
> Do you extract formatting information or just cell contents?

1. The OP requested Excel files, by construction those must be
generated with a licensed copy of Excel. I did the actual processing
on both linux and windoze platforms.
2. Yes, I used Excel 2003. I haven't looked at later versions.
3. The largest file I used was about 228M, containing 36,393 hotel
properties from Commission Junction. Each entry had 113 cells. The
parsing overhead was minimal (on a per-entry basis) -- that's why I
choose to use a pull-parser.
4. I extracted primarily cell contents, though I did some very limited
format handling (looking for non-text fields and such).

> > I started with the
> > standard xml tools (xml.dom and xml.dom.minidom). I built a
> > pullparser, and then just crack them. The MS format is tedious and
> > overly complex (like all MS stuff), but straightforward.
>
> What do you think of the xml spat out by Excel 2007's (default) xlsx
> format?

I haven't looked at this.

> >  Once I've
> > cracked them into their component parts (headers, rows, cells, etc),
> > then I walk through them doing whatever I want.
>
> > I found this material to be no worse than doing similar crud with
> > xhtml. I know there are various python packages around that do it, but
> > I found the learning curve of those packages to be steeper than just
> > grokking the spreadsheet structure itself.
>
> I'm curious to know which are the "various python packages" with the
> so steep learning curves, and what the steep bits were.

I looked, briefly, at xlrd. I found and scanned a few alternatives,
though I don't remember what the others were. I needed something I
could incorporate into my own application framework, and I knew I
didn't need most of the formatting information. I'm not in any way
criticizing xlrd, it's simply that, based on its API summary, it seems
focused on problems I didn't have to solve. I knew that I needed only
a small subset of the xlrd behavior, and I concluded (perhaps
incorrectly) that it would be easier to roll my own parser than find,
extract, and then port (to my own framework) the corresponding parts
of xlrd.

I needed to extract the content of each row, cell by cell, and build
data objects (in my framework) with the content of various cells. I
also needed to build an "exception file" containing malformed entries
that I could re-open with Excel after my code finished, so that the
bogus entries could be manually corrected. What I mean by "malformed"
entry is, for example, an address field that fails to correctly
geocode or comment fields with confused utf8/unicode contents. My
focus was on data content, as opposed to presentation. I needed to
crack the cells into things like "string", "boolean", "float", and so
on.

Most importantly, I needed to do this one entry at a time -- I did
*not* want to load the entire spreadsheet at once.

I'm not saying that this couldn't be done with xlrd; only that I chose
to roll my own and had minimal difficulty doing so.

I hope this helps!