hashing strings to integers for sqlite3 keys

Peter Otten __peter__ at web.de
Thu May 22 08:58:50 EDT 2014


Adam Funk wrote:

> I'm using Python 3.3 and the sqlite3 module in the standard library.
> I'm processing a lot of strings from input files (among other things,
> values of headers in e-mail & news messages) and suppressing
> duplicates using a table of seen strings in the database.
> 
> It seems to me --- from past experience with other things, where
> testing integers for equality is faster than testing strings, as well
> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
> --- that the SELECT tests should be faster if I am looking up an
> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY.  Is that
> right?

My gut feeling tells me that this would matter more for join operations than 
lookup of a value. If you plan to do joins you could use an autoinc integer 
as the primary key and an additional string key for lookup.
 
> If so, what sort of hashing function should I use?  The "maxint" for
> SQLite3 is a lot smaller than the size of even MD5 hashes.  The only
> thing I've thought of so far is to use MD5 or SHA-something modulo the
> maxint value.  (Security isn't an issue --- i.e., I'm not worried
> about someone trying to create a hash collision.)

Start with the cheapest operation you can think of, 

md5(s) % MAXINT

or even

hash(s) % MAXINT # don't forget to set PYTHONHASHSEED

then compare performance with just

s

and only if you can demonstrate a significant speedup keep the complication 
in your code.

If you find such a speedup I'd like to see the numbers because this cries 
PREMATURE OPTIMIZATION...




More information about the Python-list mailing list