Getting Properly Encoded Strings from Word into Python

Martin von Loewis loewis at informatik.hu-berlin.de
Sun Jan 20 10:43:06 EST 2002


"Frederick H. Bartlett" <fbartlet-fixit at optonline.net> writes:

> 1. Word (and VB/VBA) work natively in Unicode.

Mostly, yes. As you found out, they do use an alternative
representation sometimes, especially for Symbol characters.

> 2. Windows 95/98 do not.

Mostly, yes. The Win32 API is fully Unicode. It also offers a
compatibility "ANSI" API. On NT/2k/XP, the Unicode API is the native
one, and the ANSI one the emulated; on W9x, it is vice versa. You can
exchange them freely to a degree, using the CP_ACP converter (which is
exposed as "mbcs" in Python).

> 3. Windows API functions that are passed strings come in two
>     versions, one for ANSI and one for Unicode.
> 4. The Unicode versions don't work under 95/98.

Some do. Microsoft has a list of functions that work; that list is
long in W98 than it was for W95.

> 5. Word does not enforce Unicode, ANSI, or anything else.
> 6. Word passes strings to VBA functions (as .Text) without any
> formatting.

You can always find out the formattigng, though.

> 7. Word formatting is accomplished via a pointer table.
> 8. There is no direct access to that table.

Not sure what you want to say here. What table is that?

> Has anyone here had to deal with characters from the Symbol (or any
> other un-American) font in Python?

Unamerican fonts alone are no problem; Microsoft will use Unicode in
Word for all "text" data. The issue is specifically with decorative
fonts, such as Symbol or Wingdings.

For those characters, Microsoft Word uses the Unicode Private Use Area
(PUA), starting with U+F000. So the Symbol character 0x61, which ought
to be represented as U+03B1 (GREEK SMALL LETTER ALPHA), really is
represented as U+FO61.

If that (usage of the private use area) is not enough, apparently
Word.Application does non-sense when asked for the .Text property of a
Range containing characters from the PUA: it should return their
numeric values, but does something different (not sure what exactly it
does).

Now, even if it would return the numeric values from the PUA, you
still couldn't tell what character it is: U+F061, if interpreted as
Wingdings, would identify U+264B, CANCER.

It appears that the *only* way to reliably find out whether a chacter
is in such a decorated font is to use Dialogs(wdDialogInsertSymbol).
If this dialog is opened and the selection has a single character, the
dialog will have this character preselected. You can then use the
dialog's .Font property to find out whether it is "Symbol",
"Wingdings", or "(normal Text)", and you use the dialog's .charnum
property to find out the numeric value of the character. This will
normally return a negative number; add 2**16 to get the Unicode
numeric value.

HTH,
Martin



More information about the Python-list mailing list