parsing MS word docs -- tutorial request

Kay Schluehr kay.schluehr at gmx.net
Wed Oct 29 13:52:12 EDT 2008


On 28 Okt., 15:25, bp.tralfamad... at gmail.com wrote:
> All,
>
> I am trying to write a script that will parse and extract data from a
> MS Word document.  Can / would anyone refer me to a tutorial on how to
> do that?  (perhaps from tables).  I am aware of, and have downloaded
> the pywin32 extensions, but am unsure of how to proceed -- I'm not
> familiar with the COM API for word, so help for that would also be
> welcome.
>
> Any help would be appreciated.  Thanks for your attention and
> patience.
>
> ::bp::

One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A few years ago I used this conversion to implement roughly following
thing algorithm:

1. I manually highlighted one or more sections in a Word doc using a
background colour marker.
2. I searched for the colour marked section and determined the
structure. The structure information was fed into a state machine.
3. With this state machine I searched for all sections that were
equally structured.
4. I applied a href link to the text that was surrounded by the
structure and removed the colour marker.
5. In another document I searched for the same text and set an anchor.

This way I could link two documents ( those were public specifications
being originally disconnected ).

Kay




More information about the Python-list mailing list