Problem Converting Word to UTF8 Text File

patrick.waldo at gmail.com patrick.waldo at gmail.com
Sun Oct 21 12:35:43 EDT 2007


Hi all,

I'm trying to copy a bunch of microsoft word documents that have
unicode characters into utf-8 text files.  Everything works fine at
the beginning.  The word documents get converted and new utf-8 text
files with the same name get created.  And then I try to copy the data
and I keep on getting "TypeError: coercing to Unicode: need string or
buffer, instance found".  I'm probably copying the word document
wrong.  What can I do?

Thanks,
Patrick


import os, codecs, glob, shutil, win32com.client
from win32com.client import Dispatch

input = 'C:\\text_samples\\source\\*.doc'
output_dir = 'C:\\text_samples\\source\\output'
FileFormat=win32com.client.constants.wdFormatText

for doc in glob.glob(input):
    doc_copy = shutil.copy(doc,output_dir)
    WordApp = Dispatch("Word.Application")
    WordApp.Visible = 1
    WordApp.Documents.Open(doc)
    WordApp.ActiveDocument.SaveAs(doc, FileFormat)
WordApp.ActiveDocument.Close()
WordApp.Quit()


for doc in glob.glob(input):
    txt_split = os.path.splitext(doc)
    txt_doc = txt_split[0] + '.txt'
    txt_doc = codecs.open(txt_doc,'w','utf-8')
    shutil.copyfile(doc,txt_doc)




More information about the Python-list mailing list