[Python-3000] How should the hash digest of a Unicode string be computed?

Mon Aug 27 03:13:33 CEST 2007

Gregory P. Smith wrote:
> I'm in favor of not allowing unicode for hash functions.  Depending on
> the system default encoding for a hash will not be portable.
> 
> another question for hashlib:  It uses PyArg_Parse to get a single 's'
> out of an optional parameter [see the code] and I couldn't figure out
> what the best thing to do there was.  It just needs a C string to pass
> to openssl to lookup a hash function by name.  Its C so i doubt it'll
> ever be anything but ascii.  How should that parameter be parsed
> instead of the old 's' string format?  PyBUF_CHARACTER actually sounds
> ideal in that case assuming it guarantees UTF-8 but I wasn't clear
> that it did that (is it always utf-8 or the possibly useless as far as
> APIs expecting C strings are concerned system "default encoding")?
> Requiring a bytes object would also work but I really don't like the
> idea of users needing to use a specific type for something so simple.
> (i consider string constants with their preceding b, r, u, s, type
> characters ugly in code without a good reason for them to be there)
>

The PyBUF_CHARACTER flag was an add-on after I realized that the old 
buffer API was being in several places to get Unicode objects to encode 
their data as a string (in the default encoding of the system, I believe).

The unicode object is the only one that I know of that actually does 
something different when it is called with PyBUF_CHARACTER.

> test_hashlib.py passed on the x86 osx system i was using to write the
> code.  I neglected to run the full suite or grep for hashlib in other
> test suites and run those so i missed the test_unicodedata failure,
> sorry about the breakage.
> 
> Is it just me or do unicode objects supporting the buffer api seem
> like an odd concept given that buffer api consumers (rather than
> unicode consumers) shouldn't need to know about encodings of the data
> being received.

I think you have a point.   The buffer API does support the concept of 
"formats" but not "encodings" so having this PyBUF_CHARACTER flag looks 
rather like a hack.   I'd have to look, because I don't even remember 
what is returned as the "format" from a unicode object if it is 
requested (it is probably not correct).

I would prefer that the notion of encoding a unicode object is separated 
from the notion of the buffer API, but last week I couldn't see another 
way to un-tease it.

-Travis

> 
> -gps
> 
> On 8/26/07, Guido van Rossum <guido at python.org> wrote:
>> Change r57490 by Gregory P Smith broke a test in test_unicodedata and,
>> on PPC OSX, several tests in test_hashlib.
>>
>> Looking into this it's pretty clear *why* it broke: before, the 's#'
>> format code was used, while Gregory's change changed this into using
>> the buffer API (to ensure the data won't move around). Now, when a
>> (Unicode) string is passed to s#, it uses the UTF-8 encoding. But the
>> buffer API uses the raw bytes in the Unicode object, which is
>> typically UTF-16 or UTF-32. (I can't quite figure out why the tests
>> didn't fail on my Linux box; I'm guessing it's an endianness issue,
>> but it can't be that simple. Perhaps that box happens to be falling
>> back on a different implementation of the checksums?)
>>
>> I checked in a fix (because I don't like broken tests :-) which
>> restores the old behavior by passing PyBUF_CHARACTER to
>> PyObject_GetBuffer(), which enables a special case in the buffer API
>> for PyUnicode that returns the UTF-8 encoded bytes instead of the raw
>> bytes. (I still find this questionable, especially since a few random
>> places in bytesobject.c also use PyBUF_CHARACTER, presumably to make
>> tests pass, but for the *bytes* type, requesting *characters* (even
>> encoded ones) is iffy.
>>
>> But I'm wondering if passing a Unicode string to the various hash
>> digest functions should work at all! Hashes are defined on sequences
>> of bytes, and IMO we should insist on the user to pass us bytes, and
>> not second-guess what to do with Unicode.
>>
>> Opinions?
>>
>> --
>> --Guido van Rossum (home page: http://www.python.org/~guido/)
>> _______________________________________________
>> Python-3000 mailing list
>> Python-3000 at python.org
>> http://mail.python.org/mailman/listinfo/python-3000
>> Unsubscribe: http://mail.python.org/mailman/options/python-3000/greg%40krypto.org
>>