hashing strings to integers for sqlite3 keys

Thu May 22 09:41:17 EDT 2014

On 2014-05-22, Peter Otten wrote:

> Adam Funk wrote:
>
>> I'm using Python 3.3 and the sqlite3 module in the standard library.
>> I'm processing a lot of strings from input files (among other things,
>> values of headers in e-mail & news messages) and suppressing
>> duplicates using a table of seen strings in the database.
>> 
>> It seems to me --- from past experience with other things, where
>> testing integers for equality is faster than testing strings, as well
>> as from reading the SQLite3 documentation about INTEGER PRIMARY KEY
>> --- that the SELECT tests should be faster if I am looking up an
>> INTEGER PRIMARY KEY value rather than TEXT PRIMARY KEY.  Is that
>> right?
>
> My gut feeling tells me that this would matter more for join operations than 
> lookup of a value. If you plan to do joins you could use an autoinc integer 
> as the primary key and an additional string key for lookup.

I'm not doing any join operations.  I'm using sqlite3 for storing big
piles of data & persistence between runs --- not really "proper
relational database use".  In this particular case, I'm getting header
values out of messages & doing this:

  for this_string in these_strings:
    if not already_seen(this_string):
      process(this_string)
    # ignore if already seen	 

...
> and only if you can demonstrate a significant speedup keep the complication 
> in your code.
>
> If you find such a speedup I'd like to see the numbers because this cries 
> PREMATURE OPTIMIZATION...

On further reflection, I think I asked for that.  In fact, the table
I'm using only has one column for the hashes --- I wasn't going to
store the strings at all in order to save disk space (maybe my mind is
stuck in the 1980s).

-- 
But the government always tries to coax well-known writers into the
Establishment; it makes them feel educated.         [Robert Graves]