Getting Properly Encoded Strings from Word into Python

Frederick H. Bartlett fbartlet-fixit at optonline.net
Fri Jan 18 12:23:33 EST 2002


I've gotten a lot of help here (thanks, guys!), but I always seem to
need more ...

I'm still trying to figure out how to get unmangled strings out of Word
so that I can work with them in Python. Here's what I (think I) know so
far:

1. Word (and VB/VBA) work natively in Unicode.
2. Windows 95/98 do not.
3. Windows API functions that are passed strings come in two
    versions, one for ANSI and one for Unicode.
4. The Unicode versions don't work under 95/98.
5. Word does not enforce Unicode, ANSI, or anything else.
6. Word passes strings to VBA functions (as .Text) without any
formatting.
7. Word formatting is accomplished via a pointer table.
8. There is no direct access to that table.

So, if one types "α is the first letter of the Greek alphabet" in
Word and uses either "Insert | Symbol" or a font change to Symbol for
the alpha, the result will not be a Unicode string. Instead, it will be
an ANSI string with character-level formatting.

This is why the question of encoding keeps coming up from win32com
users. Unless you can force your Word users to input "special
characters" (that is, un-American characters) via Unicode, you're up the
proverbial crick.

A fellow on microsoft.public.word.vba.general, Klaus Linke, posted a
very long VB routine there that changes characters in the Symbol font to
Unicode. It loops through a document's characters collection looking for
characters in the Symbol font and replaces them with their Unicode
equivalents using a large (317 line) translation table.

I can't believe that that's the best way to overcome this problem, but
it does have the feel of Redmond about it.

Has anyone here had to deal with characters from the Symbol (or any
other un-American) font in Python?

Incidentally, the goal of all this is to use Word templates under Win98
and Python regexes to produce valid XML while shielding delicate Word
users from the intricacies of XML and XML editors. But I'd like it to be
easy and quick! And I'd rather use Python than VBA, since Python does a
*slightly* better job with XML than VB.

Thanks,
Fred



More information about the Python-list mailing list