hashing strings to integers (was: hashing strings to integers for sqlite3 keys)

Adam Funk a24061 at ducksburg.com
Fri May 23 06:27:08 EDT 2014


On 2014-05-22, Peter Otten wrote:

> Adam Funk wrote:

>> Well, J*v* returns a byte array, so I used to do this:
>> 
>>     digester = MessageDigest.getInstance("MD5");
>>     ...
>>     digester.reset();
>>     byte[] digest = digester.digest(bytes);
>>     return new BigInteger(+1, digest);
>
> In Python 3 there's int.from_bytes()
>
>>>> h = hashlib.sha1(b"Hello world")
>>>> int.from_bytes(h.digest(), "little")
> 538059071683667711846616050503420899184350089339

Excellent, thanks for pointing that out.  I've just recently started
using Python 3 instead of 2, & appreciate pointers to new things like
that.  The only thing that really bugs me in Python 3 is that execfile
has been removed (I find it useful for testing things interactively).


>> I dunno why language designers don't make it easy to get a single big
>> number directly out of these things.
>  
> You hardly ever need to manipulate the numerical value of the digest. And on 
> its way into the database it will be re-serialized anyway.

I now agree that my original plan to hash strings for the SQLite3
table was pointless, so I've changed the subject header.  :-)

I have had good reason to use int hashes in the past, however.  I was
doing some experiments with Andrei Broder's "sketches of shingles"
technique for finding partial duplication between documents, & you
need integers for that so you can do modulo arithmetic.

I've also used hashes of strings for other things involving
deduplication or fast lookups (because integer equality is faster than
string equality).  I guess if it's just for deduplication, though, a
set of byte arrays is as good as a set of int?


-- 
Classical Greek lent itself to the promulgation of a rich culture,
indeed, to Western civilization.  Computer languages bring us
doorbells that chime with thirty-two tunes, alt.sex.bestiality, and
Tetris clones.                                         (Stoll 1995)



More information about the Python-list mailing list