[Python-Dev] Re: \ud800 crashes interpreter (PR#384)

M.-A. Lemburg mal@lemburg.com
Wed, 05 Jul 2000 10:34:27 +0200


Ka-Ping Yee wrote:
> 
> On Tue, 4 Jul 2000, M.-A. Lemburg wrote:
> >
> > The reasoning at the time was that dictionaries should accept
> > Unicode objects as keys which match their string equivalents
> > as the same key, e.g. 'abc' works just as well as u'abc'.
> [...]
> > According to the docs, objects comparing equal should have the
> > same hash value, yet this would require the hash value to be
> > calculated using the default encoding and that
> > would not only cause huge performance problems, but could
> > effectively render Unicode useless,
> 
> Given the new 7-bit-ASCII-as-default-encoding-for-8-bit-strings
> convention, shouldn't just hashing the character values work
> fine?  That is, hash('abc') should == hash(u'abc'), no conversion
> required.

Yes, and it does so already for pure ASCII values. The problem
comes from the fact that the default encoding can be changed to
a locale specific value (site.py does the lookup for you), e.g.
given you have defined LANG to be us_en, Python will default
to Latin-1 as default encoding.

This results in 'äöü' == u'äöü', but hash('äöü') != hash(u'äöü'),
which is in conflict with the general rule about objects having
the same hash value if they compare equal.

Now, I could go and change the internal cache buffer to hold the
default encoding instead of UTF-8, but this would affect
not only hash(), but also the 's' and 't' parser markers, etc.

... I wonder why compiling "print u'\uD800'" causes the
hash value to be computed ...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/