[Python-Dev] Tcl and Unicode

Guido van Rossum guido@python.org
Sat, 07 Oct 2000 08:51:12 -0500


> Fix for next iteration of SF bug 115690 (Unicode headaches in IDLE).  The
> parsing functions in support of auto-indent weren't expecting Unicode
> strings, but text.get() can now return them (although it remains muddy as
> to exactly when or why that can happen).  Fixed that with a Big Hammer.

I apologize, I should have explained when text.get() returns Unicode:

Any string returned from Tcl/Tk that contains a byte with the 8th bit
set is translated from UTF-8 into Unicode, unless the translation
fails (in which case the original raw 8-bit string is returned as a
fallback).

This *should* be correct because Tcl/Tk always uses UTF-8 internally.
(Even though it is "lenient" when receiving strings -- if a sequence
of characters has no valid Unicode representation, it appears to falls
back to Latin-1; I don't know the details of this algorithm.)

--Guido van Rossum (home page: http://www.python.org/~guido/)