[Python-Dev] Encodings

Sat, 08 Jul 2000 17:06:06 -0500

> Guido van Rossum wrote:
> > I couldn't have said it better.  It's okay for now to have it
> > changeable at the C level -- with endless caveats that it should be
> > set only once before any use, and marked as an experimental feature.
> > But the Python access and the reliance on the environment should go.

[MAL replies]
> Sorry, but I'm really surprised now: I've put many hours of
> work into this, hacked up encoding support for locale.py,
> went through endless discussions, proposed the changable default
> as compromise to make all parties (ASCII, UTF-8 and Latin-1) happy
> ... and now all it takes is one single posting to render all that
> work useless ???

I'm sorry too.  As Fred Drake explained, the changeable default was an
experiment.  I won't repeat his excellent response.

I am perhaps to blame for the idea that the character set of 8-bit
strings in C can be derived in some whay from the locale -- but the
main reason I brought it up was as a counter-argument to the Latin-1
fixed default that effbot arged for.  I never dreamed that you could
actually find out the name of the character set given the locale!

> Instead of tossing things we should be *constructive* and come
> up with a solution to the hash value problem, e.g. I would
> like to make the hash value be calculated from the UTF-16
> value in a way that is compatible with ASCII strings.

I think you are proposing to drop the following rule:

  if a == b then hash(a) == hash(b)

or also

  if hash(a) != hasb(b) then a != b

This is very fundamental for dictionaries!  Note that it is currently
broken:

  >>> d = {'\200':1}
  >>> d['\200']
  1
  >>> u'\200' == '\200'
  1
  >>> d[u'\200']
  Traceback (most recent call last):
    File "<stdin>", line 1, in ?
  KeyError: €
  >>> 

While you could fix this with a variable encoding, it would be very
hard, probably involving the string to Unicode before taking its hash,
and this would slow down the hash calculation for 8-bit strings
considerably (and these are fundamental for the speed of the
language!).

So I am for restoring ASCII as the one and only fixed encoding.  (Then
you can fix your hash much easier!)

Side note: the KeyError handling is broken.  The bad key should be run
through repr() (probably when the error is raised than when it is
displayed).

--Guido van Rossum (home page: http://dinsdale.python.org/~guido/)