[Python-3000] How should the hash digest of a Unicode string be computed?

Guido van Rossum guido at python.org
Sun Aug 26 21:24:51 CEST 2007


Change r57490 by Gregory P Smith broke a test in test_unicodedata and,
on PPC OSX, several tests in test_hashlib.

Looking into this it's pretty clear *why* it broke: before, the 's#'
format code was used, while Gregory's change changed this into using
the buffer API (to ensure the data won't move around). Now, when a
(Unicode) string is passed to s#, it uses the UTF-8 encoding. But the
buffer API uses the raw bytes in the Unicode object, which is
typically UTF-16 or UTF-32. (I can't quite figure out why the tests
didn't fail on my Linux box; I'm guessing it's an endianness issue,
but it can't be that simple. Perhaps that box happens to be falling
back on a different implementation of the checksums?)

I checked in a fix (because I don't like broken tests :-) which
restores the old behavior by passing PyBUF_CHARACTER to
PyObject_GetBuffer(), which enables a special case in the buffer API
for PyUnicode that returns the UTF-8 encoded bytes instead of the raw
bytes. (I still find this questionable, especially since a few random
places in bytesobject.c also use PyBUF_CHARACTER, presumably to make
tests pass, but for the *bytes* type, requesting *characters* (even
encoded ones) is iffy.

But I'm wondering if passing a Unicode string to the various hash
digest functions should work at all! Hashes are defined on sequences
of bytes, and IMO we should insist on the user to pass us bytes, and
not second-guess what to do with Unicode.

Opinions?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list