[Python-3000] How should the hash digest of a Unicode string be computed?

Mon Aug 27 03:54:49 CEST 2007

On 8/26/07, Travis Oliphant <oliphant.travis at ieee.org> wrote:
> Gregory P. Smith wrote:
> > I'm in favor of not allowing unicode for hash functions.  Depending on
> > the system default encoding for a hash will not be portable.
> >
> > another question for hashlib:  It uses PyArg_Parse to get a single 's'
> > out of an optional parameter [see the code] and I couldn't figure out
> > what the best thing to do there was.  It just needs a C string to pass
> > to openssl to lookup a hash function by name.  Its C so i doubt it'll
> > ever be anything but ascii.  How should that parameter be parsed
> > instead of the old 's' string format?  PyBUF_CHARACTER actually sounds
> > ideal in that case assuming it guarantees UTF-8 but I wasn't clear
> > that it did that (is it always utf-8 or the possibly useless as far as
> > APIs expecting C strings are concerned system "default encoding")?
> > Requiring a bytes object would also work but I really don't like the
> > idea of users needing to use a specific type for something so simple.
> > (i consider string constants with their preceding b, r, u, s, type
> > characters ugly in code without a good reason for them to be there)
> >
>
> The PyBUF_CHARACTER flag was an add-on after I realized that the old
> buffer API was being in several places to get Unicode objects to encode
> their data as a string (in the default encoding of the system, I believe).
>
> The unicode object is the only one that I know of that actually does
> something different when it is called with PyBUF_CHARACTER.

Aha, I figured something like that.

> > test_hashlib.py passed on the x86 osx system i was using to write the
> > code.  I neglected to run the full suite or grep for hashlib in other
> > test suites and run those so i missed the test_unicodedata failure,
> > sorry about the breakage.
> >
> > Is it just me or do unicode objects supporting the buffer api seem
> > like an odd concept given that buffer api consumers (rather than
> > unicode consumers) shouldn't need to know about encodings of the data
> > being received.
>
> I think you have a point.   The buffer API does support the concept of
> "formats" but not "encodings" so having this PyBUF_CHARACTER flag looks
> rather like a hack.   I'd have to look, because I don't even remember
> what is returned as the "format" from a unicode object if it is
> requested (it is probably not correct).
>
> I would prefer that the notion of encoding a unicode object is separated
> from the notion of the buffer API, but last week I couldn't see another
> way to un-tease it.

I'll work on this some more. The problem is that it is currently
relied on in a number of places (some of which probably don't even
know it), and all those places must be changed to explicitly encode
the Unicode string instead of passing it to some API that expects
bytes.

FWIW, this is the only issue that I have with your work so far. Two of
your friends made it to the Sprint at least one day, but I have to
admit that I don't know if they made any changes.

--Guido

> -Travis
>
>
>
> >
> > -gps
> >
> > On 8/26/07, Guido van Rossum <guido at python.org> wrote:
> >> Change r57490 by Gregory P Smith broke a test in test_unicodedata and,
> >> on PPC OSX, several tests in test_hashlib.
> >>
> >> Looking into this it's pretty clear *why* it broke: before, the 's#'
> >> format code was used, while Gregory's change changed this into using
> >> the buffer API (to ensure the data won't move around). Now, when a
> >> (Unicode) string is passed to s#, it uses the UTF-8 encoding. But the
> >> buffer API uses the raw bytes in the Unicode object, which is
> >> typically UTF-16 or UTF-32. (I can't quite figure out why the tests
> >> didn't fail on my Linux box; I'm guessing it's an endianness issue,
> >> but it can't be that simple. Perhaps that box happens to be falling
> >> back on a different implementation of the checksums?)
> >>
> >> I checked in a fix (because I don't like broken tests :-) which
> >> restores the old behavior by passing PyBUF_CHARACTER to
> >> PyObject_GetBuffer(), which enables a special case in the buffer API
> >> for PyUnicode that returns the UTF-8 encoded bytes instead of the raw
> >> bytes. (I still find this questionable, especially since a few random
> >> places in bytesobject.c also use PyBUF_CHARACTER, presumably to make
> >> tests pass, but for the *bytes* type, requesting *characters* (even
> >> encoded ones) is iffy.
> >>
> >> But I'm wondering if passing a Unicode string to the various hash
> >> digest functions should work at all! Hashes are defined on sequences
> >> of bytes, and IMO we should insist on the user to pass us bytes, and
> >> not second-guess what to do with Unicode.
> >>
> >> Opinions?
> >>
> >> --
> >> --Guido van Rossum (home page: http://www.python.org/~guido/)
> >> _______________________________________________
> >> Python-3000 mailing list
> >> Python-3000 at python.org
> >> http://mail.python.org/mailman/listinfo/python-3000
> >> Unsubscribe: http://mail.python.org/mailman/options/python-3000/greg%40krypto.org
> >>
>
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
>

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)