Python and UTF-8

Thu Jan 3 20:48:41 EST 2002

Matthias Huening <mhuening at zedat.fu-berlin.de> writes:

> Hhmm, but how come that reading a text file with Python and displaying it 
> in a Tkinter text widget (with a Unicode font) will show the text just 
> fine -- regardless of the encoding used to save the file (Latin-1 or UTF-
> 8) and without specifying the encoding when opening it. Does Python guess 
> itself?

No, it is Tk that guesses - although exactly how that works depends on
the Tk version, and using a questionable algorithm. If a Tk widget
gets a byte string (rather than a Unicode string), it first assumes
that it is UTF-8. There is an almost reliable algorithm to tell
whether data is UTF-8, so that is fine.

If it finds that it is not UTF-8, it falls back. I don't exactly
remember what it falls back to: either Latin-1, or the locale's
encoding. Either fall-back is broken: the data may not be in that
encoding, but there is no way to reliably to find out. If you open a
KOI-8 file in a Latin locale, I'm pretty sure the display will be
garbage.

Regards,
Martin