How to use win32com to convert a MS WORD doc to HTML ?

Tim Golden mail at timgolden.me.uk
Tue Aug 19 11:58:54 EDT 2008


Lave wrote:
> Hi, all !
> 
> I'm a totally newbie huh:)
> 
> I want to convert MS WORD docs to HTML, I found python windows
> extension win32com can make this. But I can't find the method, and I
> can't find any document helpful.

You have broadly two approaches here, both
involving automating Word (ie using the
COM object model it exposes, referred to
in another post in this thread).

1) Use the COM model to have Word load your
doc, and SaveAs it in HTML format. Advantage:
it's relatively straightforward. Disadvantage:
you're at the mercy of whatever HTML Word emits.

2) Use the COM model to iterate over the paragraphs
in your document, emitting your own HTML. Advantage:
you get control. Disadvantage: the more complex your
doc, the more work you have to do. (What do you do with
images, for example? Internal links?)

To do the first, just record a macro in Word to
do what you want and then reproduce the macro
in Python. Something like this:

<code>
import win32com.client

doc = win32com.client.GetObject ("c:/data/temp/songs.doc")
doc.SaveAs (FileName="c:/data/temp/songs.html", FileFormat=8)
doc.Close ()

</code>

To do the second, you have to roll your own html
doc. Crudely, this would do it:

<code>
import codecs
import win32com.client
doc = win32com.client.GetObject ("c:/data/temp/songs.doc")
with codecs.open ("c:/data/temp/s2.html", "w", encoding="utf8") as f:
   f.write ("<html><body>")
   for para in doc.Paragraphs:
     text = para.Range.Text
     style = para.Style.NameLocal
     f.write ('<p class="%(style)s">%(text)s</p>\n' % locals ())

doc.Close ()

</code>

TJG



More information about the Python-list mailing list