unicode encoding usablilty problem

Thomas Heller theller at python.net
Fri Feb 18 15:18:17 EST 2005


=?ISO-8859-15?Q?Walter_D=F6rwald?= <walter at livinglogic.de> writes:

> aurora wrote:
>
>  > [...]
>> In Java they are distinct data type and the compiler would catch all
>> incorrect usage. In Python, the interpreter seems to 'help' us to
>> promote  binary string to unicode. Things works fine, unit tests
>> pass, all until  the first non-ASCII characters come in and then the
>> program breaks.
>> Is there a scheme for Python developer to use so that they are safe
>> from  incorrect mixing?
>
> Put the following:
>
> import sys
> sys.setdefaultencoding("undefined")
>
> in a file named sitecustomize.py somewhere in your Python path and
> Python will complain whenever there's an implicit conversion between
> str and unicode.

Sounds cool, so I did it.
And started a program I was currently working on.
The first function in it is this:

if sys.platform == "win32":

    def _locate_gccxml():
        import _winreg
        for subkey in (r"Software\gccxml", r"Software\Kitware\GCC_XML"):
            for root in (_winreg.HKEY_CURRENT_USER, _winreg.HKEY_LOCAL_MACHINE):
                try:
                    hkey = _winreg.OpenKey(root, subkey, 0, _winreg.KEY_READ)
                except WindowsError, detail:
                    if detail.errno != 2:
                        raise
                else:
                    return _winreg.QueryValueEx(hkey, "loc")[0] + r"\bin"

    loc = _locate_gccxml()
    if loc:
        os.environ["PATH"] = loc

All strings in that snippet are text strings, so the first approach was
to convert them to unicode literals.  Doesn't work.  Here is the final,
working version (changes are marked):

if sys.platform == "win32":

    def _locate_gccxml():
        import _winreg
        for subkey in (r"Software\gccxml", r"Software\Kitware\GCC_XML"):
            for root in (_winreg.HKEY_CURRENT_USER, _winreg.HKEY_LOCAL_MACHINE):
                try:
                    hkey = _winreg.OpenKey(root, subkey, 0, _winreg.KEY_READ)
                except WindowsError, detail:
                    if detail.errno != 2:
                        raise
                else:
                    return _winreg.QueryValueEx(hkey, "loc")[0] + ur"\bin"
#-----------------------------------------------------------------^
    loc = _locate_gccxml()
    if loc:
        os.environ["PATH"] = loc.encode("mbcs")
#--------------------------------^

So, it appears that:

- the _winreg.QueryValue function is strange: it takes ascii strings,
  but returns a unicode string.
- _winreg.OpenKey takes ascii strings
- the os.environ["PATH"] accepts an ascii string.

And I won't guess what happens when there are registry entries with
unlauts (ok, they could be handled by 'mbcs' encoding), and with chinese
or japanese characters (no way to represent them in ascii strings with a
western locale and mbcs encoding, afaik).


I suggest that 'sys.setdefaultencoding("undefined")' be the standard
setting for the core developers ;-)

Thomas



More information about the Python-list mailing list