Parsing Excel spreadsheets

Fri Jan 2 11:32:39 EST 2009

brooklineTom wrote:
> On Dec 31 2008, 9:56 am, John Machin <sjmac... at lexicon.net> wrote:
>> On Dec 31 2008, 4:02 pm, brooklineTom <Brookline... at gmail.com> wrote:
>>
>>> andyh... at gmail.com wrote:
>>>> Hi,
>>>> Can anybody recommend an approach for loading and parsing Excel
>>>> spreadsheets in Python. Any well known/recommended libraries for this?
>>>> The only thing I found in a brief search washttp://www.lexicon.net/sjmachin/xlrd.htm,
>>>> but I'd rather get some more input before going with something I don't
>>>> know.
>>>> Thanks,
>>>> Andy.
>>> I save the spreadsheets (in Excel) in xml format.
>> Which means that you need to be on a Windows box with a licensed copy
>> of Excel. I presume you talking about using Excel 2003 and saving as
>> "XML Spreadsheet (*.xml)". Do you save the files manually, or using a
>> COM script? What is the largest xls file that you've saved as xml, how
>> big was the xml file, and how long did it take to parse the xml file?
>> Do you extract formatting information or just cell contents?
> 
> 1. The OP requested Excel files, by construction those must be
> generated with a licensed copy of Excel. I did the actual processing
> on both linux and windoze platforms.

Well, even if Andy meant "Excel files" rather than "Excel-formatted
files" there are many ways to come by these without having a licensed
copy of Excel. FTP and email attachment come to mind most readily.

How then to convert those to XML without Excel?

[...]> I looked, briefly, at xlrd. I found and scanned a few alternatives,
> though I don't remember what the others were. I needed something I
> could incorporate into my own application framework, and I knew I
> didn't need most of the formatting information. I'm not in any way
> criticizing xlrd, it's simply that, based on its API summary, it seems
> focused on problems I didn't have to solve. I knew that I needed only
> a small subset of the xlrd behavior, and I concluded (perhaps
> incorrectly) that it would be easier to roll my own parser than find,
> extract, and then port (to my own framework) the corresponding parts
> of xlrd.
> 
My own case was similar, in that I only needed the value data. The
approach I took was to install xlrd and use it. Job done.

> I needed to extract the content of each row, cell by cell, and build
> data objects (in my framework) with the content of various cells. I
> also needed to build an "exception file" containing malformed entries
> that I could re-open with Excel after my code finished, so that the
> bogus entries could be manually corrected. What I mean by "malformed"
> entry is, for example, an address field that fails to correctly
> geocode or comment fields with confused utf8/unicode contents. My
> focus was on data content, as opposed to presentation. I needed to
> crack the cells into things like "string", "boolean", "float", and so
> on.
> 
> Most importantly, I needed to do this one entry at a time -- I did
> *not* want to load the entire spreadsheet at once.
> 
My data files weren't that large (IIRC the largest spreadsheet was about
6MB), so I was quite happy to load the whole thing in memory, iterate
over it and then write the results to the database as they were extracted.

> I'm not saying that this couldn't be done with xlrd; only that I chose
> to roll my own and had minimal difficulty doing so.
> 
Given the constraints of your problem it seems like an intelligent approach.

> I hope this helps!

I'm sure it will.

regards
 Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/