hashing strings to integers for sqlite3 keys

Thu May 22 09:03:34 EDT 2014

On Thu, May 22, 2014 at 9:47 PM, Adam Funk <a24061 at ducksburg.com> wrote:
> I'm using Python 3.3 and the sqlite3 module in the standard library.
> I'm processing a lot of strings from input files (among other things,
> values of headers in e-mail & news messages) and suppressing
> duplicates using a table of seen strings in the database.
>
> It seems to me --- from past experience with other things, where
> testing integers for equality is faster than testing strings, as well
> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
> --- that the SELECT tests should be faster if I am looking up an
> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY.  Is that
> right?

It might be faster to use an integer primary key, but the possibility
of even a single collision means you can't guarantee uniqueness
without a separate check. I don't know sqlite3 well enough to say, but
based on what I know of PostgreSQL, it's usually best to make your
schema mimic your logical structure, rather than warping it for the
sake of performance. With a good indexing function, the performance of
a textual PK won't be all that much worse than an integral one, and
everything you do will read correctly in the code - no fiddling around
with hashes and collision checks.

Stick with the TEXT PRIMARY KEY and let the database do the database's
job. If you're processing a really large number of strings, you might
want to consider moving from sqlite3 to PostgreSQL anyway (I've used
psycopg2 quite happily), as you'll get better concurrency; and that
might solve your performance problem as well, as Pg plays very nicely
with caches.

ChrisA