Parsing Excel spreadsheets

brooklineTom BrooklineTom at gmail.com
Wed Jan 7 16:34:23 EST 2009


On Jan 2, 7:04 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Jan 3, 2:01 am, brooklineTom <Brookline... at gmail.com> wrote:
>
> <snip>
>
> My point was that however the original XLS files were created or
> acquired, the first step in your solution involves converting the XLS
> file to "XML Spreadsheet" format, which requires a copy of Excel on a
> Windows box. Many people start with an XLS file, no Excel and no
> Windows box, no COM, and users who can't be relied on to open a file
> and save it in the right format with the right name and extension.

True enough. I develop on a WinXP box and have Excel. I just used it.

> BTW, did you consider opening the XLS files with OpenOffice.org's Calc
> and saving it in their default ods format (chunks of XML in a zip
> file)?

No. As I said, I have Excel.

> > 3. The largest file I used was about 228M,
>
> Is that the XLS file or the XML file?

That's the xml file size. I just *love* file-bloat, don't you? :-)

>
> <snip>
>

> > I knew that I needed only
> > a small subset of the xlrd behavior, and I concluded (perhaps
> > incorrectly) that it would be easier to roll my own parser than find,
> > extract, and then port (to my own framework) the corresponding parts
> > of xlrd.
>
> Possibly incorrectly. If approached at the time, I would have said:
> (a) if desperate to DIY:
> (a1) ignore any code for old Excel versions (self.biff_version < 80)
> (a2) ignore any code for extracting formatting info
> (self.formatting_info)
> (a3) find the Sheet.put_cell* methods in sheet.py e.g.
>
> def put_cell(self, rowx, colx, ctype, value, xf_index):
>
> ignore the xf_index arg and subvert them to your own needs instead of
> filling up a big rectangular arena with data
>
> (b) if not really so desperate, talk to me about implementing an
> option in xlrd where callers can specify a callback to be used instead
> of the Sheet.put_cell* methods
>
> (c) What is all this "port to my own framework" caper anyway? If you
> need to extract data from a database, do you rummage in their code
> libraries and port the relevant bits to your own framework?

I'm sure xlrd is fine, I wasn't any way suggesting that any
"improvements" are needed. I was just explaining what I did.

>
> <snip>
>

> "Crack"? It's not the Enigma code. It's not even rot13. Spreadsheet
> XML tells you the type (String, Boolean, ...). xlrd tells you the type
> (XL_CELL_TEXT, XL_CELL_BOOLEAN, ...).

Sure, once you've sorted through the excess MS-specific stuff, dealt
with the runs of empty cells, and so on.

I needed to write an exception file, containing the original
spreadsheet entries that failed, so that our curators could find and
fix (or delete) them. Thus, I already had to do row-by-row handling of
the input, I had to catch and handle exceptions, I had to know and
remember enough about the original SS format/layout to replicate it in
the exception file, and so on.

> I think so. Thanks. You didn't directly address the "steep learning
> curve" question, but you explained enough of where you came from.
>
> It's a matter of "horses for courses". I guess some people might
> regard (xml.dom, xml.dom.minidom, DIY "pullparser") as having a
> slightly non-horizontal learning curve :-)

*LOL*

Yes. Sadly, I had already climbed the dom/minidom/pullparser curve for
all the other stuff I have to do. In the space I work in, xml handling
is pervasive enough that I already have all that stuff "in my hands"
anyway.

I wonder if the OP ever read any of this.... :-)

Thx,
Tom



More information about the Python-list mailing list