[Python-Dev] RE: \ud800 crashes interpreter (PR#384)

Bill Tutt billtut@microsoft.com
Tue, 4 Jul 2000 15:39:26 -0700


> MAL wrotw:
>> Bill wrote:
>> u'\ud800' causes the interpreter to crash
>> example:
>> print u'\ud800'
>> What happens:
>> The code failes to compile because while adding the constant, the
unicode_hash
>> function is called which for some reason requires the UTF-8 string
format.

> The reasoning at the time was that dictionaries should accept
> Unicode objects as keys which match their string equivalents
> as the same key, e.g. 'abc' works just as well as u'abc'.

> UTF-8 was the default encoding back then. I'm not sure how
> to fix the hash value given the new strategy w/r to the
> default encoding... 

> According to the docs, objects comparing equal should have the
> same hash value, yet this would require the hash value to be
> calculated using the default encoding and that
> would not only cause huge performance problems, but could
> effectively render Unicode useless, because not all default
> encodings are lossless (ok, one could work around this by
> falling back to some other way of calculating the hash
> value in case the conversion fails).
 
Yeah, yeah, yeah. I know all that, just never liked it. :)
The current problem is that the UTF-8 can't round trip surrogate characters
atm.
This is easy to fix, so I'm doing a patch to fix this oversight, unless you
beat me to it.

Anything else is slightly more annoying to fix.

Bill