[I18n-sig] New Unicode default encoding scheme

Guido van Rossum guido@beopen.com
Mon, 10 Jul 2000 09:25:05 -0500


> On Fri, 09 Jun 2000 13:09:19 +0200, "M.-A. Lemburg" <mal@lemburg.com>
> wrote:
> 
> >For this, the implementation maintains a global which can be set in
> >the site.py Python startup script. Subsequent changes are not
> >possible. The <default encoding> can be set and queried using the
> >two sys module APIs:
> 
> I'm confused about the justification for this restriction. I can see
> that frequent arbitrary changes would be bad style, but is there any
> reason stronger than that?
> 
> For Zope, the right place to set the default encoding is in __main__,
> which doesnt seem unreasonable.
> 
> 
> At the moment the restriction is enforced with a
> 'del sys.setdefaultencoding' near the end of site.py. This means the
> restriction can be bypassed with a 'reload(sys)'. Am I going to regret
> doing that?

Yes, when it is dropped from the sys module altogether.  Remember that
it's an experimental feature!  There's currently a discussion
regarding this issue that will make this a likely outcome.  The
default encoding may well become fixed to ASCII for all practical
purposes.

One particular nasty issue is that allowing the default encoding to
change may affect dictionary lookups in a bad way.  I'll try to
explain the issue here.

First, Python uses the rule that if two objects a and b compare equal
using ==, they can be used interchangeably as dictionary keys, even if
they have different types.  So, if d is {0:'yo'}, then d[0], d[0L],
d[0.0], and d[0j] all succeed returning 'yo'.  Similarly, if d is
{'a':'ho'}, then d['a'] and d[u'a'] both return the same thing.

Now consider d = {'\200': 'bo'}.  If the encoding is variable, the
Unicode character that is equal to '\200' is also variable.  So, at
one point in the program, where the default encoding is Latin-1,
d[u'\200'] might work; but it might be illegal in another part, where
the default encoding maps '\200' to something else (or possibly it's
even invalid as a Unicode encoding, e.g. when the encoding is UTF-8).

This by itself is not a showstopper.  However if we look into the
implementation of dictionaries, we see that the hash() function is
used to make lookups in the internal hash table fast.  The use of the
hash() function by the dictionary type requires that if two values
compare equal, they *must* have the same hash() value.  Otherwise, we
can run into the situation where a value is equal to one of the keys
of the dictionary, but it isn't found when used in a lookup, because
its hash is different!  (The same restriction is the reason why
mutable types like lists cannot be used as dictionary keys.)  The
speed of the dictionary implementation is fundamental for the speed of
Python, and changing its implementation to rely less on hash values
would mean an unacceptable degradation in performance.  (And yes,
failing lookups must also be fast!)

Making the default encoding variable causes unsurmountable problems
for the hash() function.  A fixed encoding (whether UTF-8, ASCII or
Latin-1) means that we can code the hash() of 8-bit and Unicode
strings to have the same result for 8-bit character strings and
Unicode strings that compare equal.


Toby, would it be a problem for Zope if the system's default encoding
were ASCII?  You can still introduce the concept of a Zope default
encoding, to be applied explicitly by all Zope code whenever you need
it.  This is what we are trying to get applications to do anyway:
don't rely on the default encoding, always be explicit about the
encoding.  The ASCII default is intended to avoid having to worry
about encodings when using ASCII string literals in code that
manipulates strings and would otherwise work fine with either 8-bit or
Unicode strings.

--Guido van Rossum (home page: http://dinsdale.python.org/~guido/)