hashing strings to integers for sqlite3 keys

alister alister.nospam.ware at ntlworld.com
Thu May 22 10:48:19 EDT 2014


On Thu, 22 May 2014 12:47:31 +0100, Adam Funk wrote:

> I'm using Python 3.3 and the sqlite3 module in the standard library. I'm
> processing a lot of strings from input files (among other things, values
> of headers in e-mail & news messages) and suppressing duplicates using a
> table of seen strings in the database.
> 
> It seems to me --- from past experience with other things, where testing
> integers for equality is faster than testing strings, as well as from
> reading the SQLite3 documentation about INTEGER PRIMARY KEY --- that the
> SELECT tests should be faster if I am looking up an INTEGER PRIMARY KEY
> value rather than TEXT PRIMARY KEY.  Is that right?
> 
> If so, what sort of hashing function should I use?  The "maxint" for
> SQLite3 is a lot smaller than the size of even MD5 hashes.  The only
> thing I've thought of so far is to use MD5 or SHA-something modulo the
> maxint value.  (Security isn't an issue --- i.e., I'm not worried about
> someone trying to create a hash collision.)
> 
> Thanks,
> Adam

why not just set the filed in the DB to be unique & then catch the error 
when you try to Wright a duplicate?

let the DB engine handle the task


-- 
Your step will soil many countries.



More information about the Python-list mailing list