unicode and hashlib

Scott David Daniels Scott.Daniels at Acm.Org
Fri Nov 28 14:24:14 EST 2008


Jeff H wrote:
> hashlib.md5 does not appear to like unicode,
>   UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in
> position 1650: ordinal not in range(128)
> 
> After googling, I've found BDFL and others on Py3K talking about the
> problems of hashing non-bytes (i.e. buffers) ...
Unicode is characters, not a character encoding.
You could hash on a utf-8 encoding of the Unicode.

> So what is the canonical way to hash unicode?
>  * convert unicode to local
>  * hash in current local
> ???
There is no _the_ way to hash Unicode, any more than
there is no _the_ way to hash vectors.  You need to
convert the abstract entity something concrete with
a well-defined representation in bytes, and hash that.

> Is this just a problem for md5 hashes that I would not encounter using
> a different method?  i.e. Should I just use the built-in hash function?
No, it is a definitional problem.  Perhaps you could explain how you
want to use the hash.  If the internal hash is acceptable (e.g. for
grouping in dictionaries within a single run), use that.  If you intend
to store and compare on the same system, say that.  If you want cross-
platform execution of your code to produce the same hashes, say that.
A hash is a means to an end, and it is hard to give advice without
knowing the goal.

--Scott David Daniels
Scott.Daniels at Acm.Org



More information about the Python-list mailing list