Parsing Excel spreadsheets

Wed Dec 31 09:56:33 EST 2008

On Dec 31 2008, 4:02 pm, brooklineTom <Brookline... at gmail.com> wrote:
> andyh... at gmail.com wrote:
> > Hi,
>
> > Can anybody recommend an approach for loading and parsing Excel
> > spreadsheets in Python. Any well known/recommended libraries for this?
>
> > The only thing I found in a brief search washttp://www.lexicon.net/sjmachin/xlrd.htm,
> > but I'd rather get some more input before going with something I don't
> > know.
>
> > Thanks,
> > Andy.
>
> I save the spreadsheets (in Excel) in xml format.

Which means that you need to be on a Windows box with a licensed copy
of Excel. I presume you talking about using Excel 2003 and saving as
"XML Spreadsheet (*.xml)". Do you save the files manually, or using a
COM script? What is the largest xls file that you've saved as xml, how
big was the xml file, and how long did it take to parse the xml file?
Do you extract formatting information or just cell contents?

> I started with the
> standard xml tools (xml.dom and xml.dom.minidom). I built a
> pullparser, and then just crack them. The MS format is tedious and
> overly complex (like all MS stuff), but straightforward.

What do you think of the xml spat out by Excel 2007's (default) xlsx
format?

>  Once I've
> cracked them into their component parts (headers, rows, cells, etc),
> then I walk through them doing whatever I want.
>
> I found this material to be no worse than doing similar crud with
> xhtml. I know there are various python packages around that do it, but
> I found the learning curve of those packages to be steeper than just
> grokking the spreadsheet structure itself.

I'm curious to know which are the "various python packages" with the
so steep learning curves, and what the steep bits were.

Cheers,
John