Parsing MS Word Document?
Dave Kuhlman
dkuhlman at rexx.com
Wed Oct 22 12:23:02 EDT 2003
MrBill wrote:
> I would like to be able to open, read, and extract data from a
> report that
> is produced in MS Word. The doc seems to contain embedded
> spreadsheets. I would like to extract some of the data from the
> spreadsheets and feed it
> into another application. I've been reading a little bit about
> OLE and MS Word and sure would like to find a module that hides
> some of this so-called innovation from me.
Here is another strategy:
1. Load the document into MS Word. Save the document as HTML.
2. Run the `links` Web browser on the file with the -dump option.
This will convert the HTML into plain text. Example:
links -dump mydoc.html > mydoc.txt
3. Use Python to extract information from the resulting plain text
file.
Another suggestion -- The Web browser `links` formats tables
differently from and perhaps better than `lynx`. But, you might
try lynx, too.
Dave
--
Dave Kuhlman
http://www.rexx.com/~dkuhlman
dkuhlman at rexx.com
More information about the Python-list
mailing list