Problem Converting Word to UTF8 Text File

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Sun Oct 21 23:39:20 EDT 2007


En Sun, 21 Oct 2007 15:32:57 -0300, <patrick.waldo at gmail.com> escribi�:

> However, I still cannot read the unicode from the Word file.  If take
> out the first for-statement, I get a bunch of garbled text, which
> isn't helpful.  I would save them all manually, but I want to figure
> out how to do it in Python, since I'm just beginning.
>
> My intuition says the problem is with
>
> FileFormat=win32com.client.constants.wdFormatText
>
> because it converts fine to a text file, just not a utf-8 text file.
> How can I  modify this or is there another way to code this type of
> file conversion from *.doc to *.txt with unicode characters?

Ah! I thought you were getting the right file format.
I can't test it now, but this KB document
http://support.microsoft.com/kb/209186/en-us
suggests you should use wdFormatUnicodeText when saving the document.
What the MS docs call "unicode" when dealing with files, is in general  
utf16.
In this case, if you want to convert to utf8, the sequence would be:

    f = open(original_filename, "rb")
    udata = f.read().decode("utf16")
    f.close()
    f = open(new_filename, "wb")
    f.write(udata.encode("utf8"))
    f.close()

-- 
Gabriel Genellina




More information about the Python-list mailing list