unicode and hashlib

Sat Nov 29 09:23:35 EST 2008

On Nov 28, 1:24 pm, Scott David Daniels <Scott.Dani... at Acm.Org> wrote:
> Jeff H wrote:
> > hashlib.md5 does not appear to like unicode,
> >   UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> > position 1650: ordinal not in range(128)
>
> > After googling, I've found BDFL and others on Py3K talking about the
> > problems of hashing non-bytes (i.e. buffers) ...
>
> Unicode is characters, not a character encoding.
> You could hash on a utf-8 encoding of the Unicode.
>
> > So what is the canonical way to hash unicode?
> >  * convert unicode to local
> >  * hash in current local
> > ???
>
> There is no _the_ way to hash Unicode, any more than
> there is no _the_ way to hash vectors.  You need to
> convert the abstract entity something concrete with
> a well-defined representation in bytes, and hash that.
>
> > Is this just a problem for md5 hashes that I would not encounter using
> > a different method?  i.e. Should I just use the built-in hash function?
>
> No, it is a definitional problem.  Perhaps you could explain how you
> want to use the hash.  If the internal hash is acceptable (e.g. for
> grouping in dictionaries within a single run), use that.  If you intend
> to store and compare on the same system, say that.  If you want cross-
> platform execution of your code to produce the same hashes, say that.
> A hash is a means to an end, and it is hard to give advice without
> knowing the goal.
>
I am checking for changes to large text objects stored in a database
against outside sources. So the hash needs to be reproducible/stable.

> --Scott David Daniels
> Scott.Dani... at Acm.Org