[Python-3000] How should the hash digest of a Unicode string be computed?

Mon Aug 27 19:59:40 CEST 2007

On 8/26/07, Guido van Rossum <guido at python.org> wrote:
> But I'm wondering if passing a Unicode string to the various hash
> digest functions should work at all! Hashes are defined on sequences
> of bytes, and IMO we should insist on the user to pass us bytes, and
> not second-guess what to do with Unicode.

Conceptually, unicode *by itself* can't be represented as a buffer.

What can be represented is a unicode string + an encoding.  The
question is whether the hash function needs to know the encoding to
figure out the hash.

If you're hashing arbitrary bytes, then it doesn't really matter --
there is no expectation that a recoding should have the same hash.

For hashing as a shortcut to __ne__, it does matter for text.

Unfortunately, for historical reasons, plenty of code grabs the string
buffer expecting text.

For dict comparisons, we really ought to specify the equality (and
therefore hash) in terms of a canonical equivalent, encoded in X (It
isn't clear to me that X should be UTF-8 in particular, but the main
thing is to pick something.)

The alternative is that defensive code will need to do a (normally
useless boilerplate) decode/canonicalize/reencode dance before
dictionary checks and insertions.

I would rather see that boilerplate done once in the unicode type (and
again in any equivalent types, if need be), because
   (1)  most storage type/encodings would be able to take shortcuts.
   (2)  if people don't do the defensive coding, the bugs will be very obscure

-jJ