Parsing MS Word Document?

John J. Lee jjl at pobox.com
Wed Oct 22 08:51:44 EDT 2003


"MrBill" <nospam at nospam.com> writes:

> I would like to be able to open, read, and extract data from a report that
> is produced in MS Word.  The doc seems to contain embedded spreadsheets.  I
> would like to extract some of the data from the spreadsheets and feed it
> into another application.  I've been reading a little bit about OLE and MS
> Word and sure would like to find a module that hides some of this so-called
> innovation from me.

:-)  Yeah, isn't all that baroque complexity wonderful?

1. Alex Martelli's suggestion on this list: use RTF.  Word can import
   and export to it.  You can automate that from VB or Python in the
   usual COM ways (see 3.).  I don't know whether you'll get useful
   RTF out of embedded Excel sheets, though.

2. Use OpenOffice via PyUNO.

3. As you already know, use the MS Office object models, with Python
   for Windows extensions (or ctypes, if you're brave).  Perhaps ADO
   is what you're looking for?  IIRC, ADO isn't too complicated and
   can treat Excel sheets as data sources just as it does for
   relational databases.

For simpler Word docs (no embedded stuff), there are other tools out
there, but they'd be no use in this case.

A useful tip for 3. is to record a VB macro in Word, then edit it to
something sane.  You can keep it in VB, or do the relatively trivial
edits required to convert it to Python.  Here's an example on
automating RTF generation:

http://www.google.com/groups?q=author:jjl%40pobox.com+RTF+Word&hl=en&lr=&ie=UTF-8&selm=87isqnnxvy.fsf%40pobox.com&rnum=1


John




More information about the Python-list mailing list