[Python-Dev] Tcl and Unicode

Tim Peters tim_one@email.msn.com
Sat, 7 Oct 2000 14:43:55 -0400


>> Fix for next iteration of SF bug 115690 (Unicode headaches in
>> IDLE). ...

[Guido]
> I apologize, I should have explained when text.get() returns Unicode:
>
> Any string returned from Tcl/Tk that contains a byte with the 8th bit
> set is translated from UTF-8 into Unicode, unless the translation
> fails (in which case the original raw 8-bit string is returned as a
> fallback).

Except that's *why* it was muddy <wink>:  in the specific case that popped
up in the bug, text.get() appeared to return a Unicode string of length 1
containing only a newline.  No high-bit byte appeared to be involved.
However, that was an illusion I didn't unmask until later.  All is clear
now.

> This *should* be correct because Tcl/Tk always uses UTF-8 internally.
> (Even though it is "lenient" when receiving strings -- if a sequence
> of characters has no valid Unicode representation, it appears to falls
> back to Latin-1; I don't know the details of this algorithm.)

Dunno, but wouldn't be surprised if they had a notion of default encoding,
and that it simply appears to be Latin-1 to us because American Windows uses
a superset of Latin-1.  If BeOpen would like to buy me a version of Chinese
Windows, happy to lend it to you <wink>.

as-american-as-they-come-ly y'rs  - tim