hashing strings to integers for sqlite3 keys
Peter Otten
__peter__ at web.de
Thu May 22 08:58:50 EDT 2014
Adam Funk wrote:
> I'm using Python 3.3 and the sqlite3 module in the standard library.
> I'm processing a lot of strings from input files (among other things,
> values of headers in e-mail & news messages) and suppressing
> duplicates using a table of seen strings in the database.
>
> It seems to me --- from past experience with other things, where
> testing integers for equality is faster than testing strings, as well
> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
> --- that the SELECT tests should be faster if I am looking up an
> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY. Is that
> right?
My gut feeling tells me that this would matter more for join operations than
lookup of a value. If you plan to do joins you could use an autoinc integer
as the primary key and an additional string key for lookup.
> If so, what sort of hashing function should I use? The "maxint" for
> SQLite3 is a lot smaller than the size of even MD5 hashes. The only
> thing I've thought of so far is to use MD5 or SHA-something modulo the
> maxint value. (Security isn't an issue --- i.e., I'm not worried
> about someone trying to create a hash collision.)
Start with the cheapest operation you can think of,
md5(s) % MAXINT
or even
hash(s) % MAXINT # don't forget to set PYTHONHASHSEED
then compare performance with just
s
and only if you can demonstrate a significant speedup keep the complication
in your code.
If you find such a speedup I'd like to see the numbers because this cries
PREMATURE OPTIMIZATION...
More information about the Python-list
mailing list