[Python-Dev] Encodings
Guido van Rossum
guido@beopen.com
Sat, 08 Jul 2000 17:06:06 -0500
> Guido van Rossum wrote:
> > I couldn't have said it better. It's okay for now to have it
> > changeable at the C level -- with endless caveats that it should be
> > set only once before any use, and marked as an experimental feature.
> > But the Python access and the reliance on the environment should go.
[MAL replies]
> Sorry, but I'm really surprised now: I've put many hours of
> work into this, hacked up encoding support for locale.py,
> went through endless discussions, proposed the changable default
> as compromise to make all parties (ASCII, UTF-8 and Latin-1) happy
> ... and now all it takes is one single posting to render all that
> work useless ???
I'm sorry too. As Fred Drake explained, the changeable default was an
experiment. I won't repeat his excellent response.
I am perhaps to blame for the idea that the character set of 8-bit
strings in C can be derived in some whay from the locale -- but the
main reason I brought it up was as a counter-argument to the Latin-1
fixed default that effbot arged for. I never dreamed that you could
actually find out the name of the character set given the locale!
> Instead of tossing things we should be *constructive* and come
> up with a solution to the hash value problem, e.g. I would
> like to make the hash value be calculated from the UTF-16
> value in a way that is compatible with ASCII strings.
I think you are proposing to drop the following rule:
if a == b then hash(a) == hash(b)
or also
if hash(a) != hasb(b) then a != b
This is very fundamental for dictionaries! Note that it is currently
broken:
>>> d = {'\200':1}
>>> d['\200']
1
>>> u'\200' == '\200'
1
>>> d[u'\200']
Traceback (most recent call last):
File "<stdin>", line 1, in ?
KeyError: €
>>>
While you could fix this with a variable encoding, it would be very
hard, probably involving the string to Unicode before taking its hash,
and this would slow down the hash calculation for 8-bit strings
considerably (and these are fundamental for the speed of the
language!).
So I am for restoring ASCII as the one and only fixed encoding. (Then
you can fix your hash much easier!)
Side note: the KeyError handling is broken. The bad key should be run
through repr() (probably when the error is raised than when it is
displayed).
--Guido van Rossum (home page: http://dinsdale.python.org/~guido/)