Unicode, COM, Word Problem

matt matt at virtualspectator.com
Sat Jan 20 06:09:47 EST 2001


Not that I have tried doing anything like what you are doing, but I do see
something potentially nasty.  I assume that your copy clipboard is receiving
the characters in Microsofts pseudo interpretation of ISO-8859-1.  For a start
ISO-8859-1 is a full 256 character set, so even if Microsofts version of
ISO-8859-1 was correct, then you would get many characters outside the standard
128 characters of ascii.  The thing to watch with Microsoft is that they start
to use a range of characters usually designed for control characters to
actually hold some punctuation etc ... these are in the range of 0x7F to 0x9F
hex, for example they use 0x80 for &euro, whereas it is not defined in
ISO-8859-1.

My guess is that you need to 1) not parse your text as ascii, 2) parse it
instead as ISO-8859-1, but 3) take account of their use of the undefined set
using a translation table to get them back into valid ISO-8859-1.

But that's all just a guess.  I am a 'nix person who has to deal with people
copying and pasting from "word" into form submissions in a browser ... so I
have been stung by something similar.

Matt



On Sat, 20 Jan 2001, Kirby James wrote:
> Hi, I'm writing a script to read and summarize a large number of MS Word-97
> documents, using the Python win32com.client. I open each file in turn (using
> COM and Word) and select and copy the first 500 characters.
>     myWord.Selection.HomeKey(constants.wdStory)
>     myWord.Selection.MoveEnd(constants.wdCharacter, 500)
>     sText = myWord.Selection.Text
> However when I try to print sText (or write it to a file
> output.write(sText)) I get an exception
>     UnicodeError: ASCII encoding error: ordinal not in range(128)
> I've tried using str
>      sText2 = str(sText)
> but still get the same error. I've tried wrapping the code with a try:
> except UnicodeError: block but then the string is not converted. I'd
> appreciate any pointers as to how I can 'clean-up' this string so that I can
> output it as an 8-bit character string.
> Tks Kirby
> 
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list
-- 




More information about the Python-list mailing list