Parsing MS Word Document?

Dave Kuhlman dkuhlman at rexx.com
Wed Oct 22 12:23:02 EDT 2003


MrBill wrote:

> I would like to be able to open, read, and extract data from a
> report that
> is produced in MS Word.  The doc seems to contain embedded
> spreadsheets.  I would like to extract some of the data from the
> spreadsheets and feed it
> into another application.  I've been reading a little bit about
> OLE and MS Word and sure would like to find a module that hides
> some of this so-called innovation from me.

Here is another strategy:

1. Load the document into MS Word.  Save the document as HTML.

2. Run the `links` Web browser on the file with the -dump option.
   This will convert the HTML into plain text.  Example:

       links -dump mydoc.html > mydoc.txt

3. Use Python to extract information from the resulting plain text
   file.

Another suggestion -- The Web browser `links` formats tables
differently from and perhaps better than `lynx`.  But, you might
try lynx, too.

Dave

-- 
Dave Kuhlman
http://www.rexx.com/~dkuhlman
dkuhlman at rexx.com




More information about the Python-list mailing list