Tkinter wart: returned texts are sometimes strings, sometime Unicode strings

Thu Mar 20 15:02:38 EST 2003

Suppose that somebody enters the string containing U+00A1.  Returning any
non-unicode string (for instance '\xa1' (latin-1) or '\xc2\xa1' (utf-8))
is going to be wrong for some applications in some locales.

I assume that returning a plain string when possible is a space
optimization, but I wouldn't be sad to see it go (or become an internal
optimization, like the merger of "machine" and L-suffixed integers,
if this could in fact be done fairly painlessly).

IMO, it's intended that Python code will automatically accept Unicode
strings anywhere regular strings were originally used, except in
interfaces which are explicitly byte-oriented.  In addition, several
facilities exist for common byte-oriented interfaces (file i/o being the
major one) to automatically encode the string into its byte-oriented
representation.

However, there is one thing I might be in favor of.  If you're working
in one of those rare environments where using sys.setdefaultencoding()
makes sense, then *maybe*  the following sequence should give you a
plain string rather than a unicode one:
    $ python -S
    Python 2.2.2 (#1, Oct 24 2002, 10:50:06) 
    [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-110)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.setdefaultencoding("utf-8")
    >>> import site
    >>> import Tkinter
    >>> t = Tkinter.Entry()
    t.>>> t.pack()
    >>> t.insert(0, u"\xa1")
    >>> t.get()  # should possibly be '\xc2\xa1' instead
    u'\xa1'

By the way, in this statement
    > if type(text) == type(unicode('')): text = text.encode(...)
type(unicode('')) is unicode.  you should probably actually write this:
    if isinstance(text, unicode): text = text.encode(...)
Of course, it's harmless (but extra work) to .encode() a string:
    >>> "abcd".encode("utf-8")
    'abcd'
and if you've used sys.setdefaultencoding(), a plain str() can work:
    >>> str(t.get())
    '\xc2\xa1'
either of these techniques mean that you can derive versions of any
widget type, overriding the method that returns the unicode strings you
don't want.  Or you could subclass only Tk, wrapping self.tk with
something that returns encoded strings from the necessary methods.
Subclasses just copy the value of the parent's tk attribute, so there'd
be no need to change every method that might return a string, just the
methods on one object.

Jeff