Searching for lots of similar strings (filenames) in sqlite3 database

Tue Jul 1 11:15:47 EDT 2014

On 2014-07-01, Chris Angelico wrote:

> On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk <a24061 at ducksburg.com> wrote:
>>         cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,))
>
> Shouldn't this be an equality check rather than IS, which normally I'd
> expect to be "IS NULL" or "IS NOT NULL"?

Oh, it probably should be in "heavy" SQL.  In SQLite, '==', '=', &
'IS' are interchangeable.

http://www.tutorialspoint.com/sqlite/sqlite_operators.htm

Looking at that page again, I see that 'GLOB' is a case-sensitive
version of 'LIKE'.  I can't help but wonder if that makes it faster.
;-)

> As to your actual question: Your two database lookups are doing
> distinctly different things, so there's no surprise that they perform
> very differently. B asks the database "Do you have this? Do you have
> this?" for every file you have, and C asks the database "What do you
> have?", and then comparing that against the list of files. By the way
> - the A+C technique could be done quite tidily as a set difference:
>
> # assume you have listing1 and cursor set up
> # as per your above code
>     listing = {os.path.join(directory, x) for x in listing1}
>     cursor.execute(...) # as per above
>     known_files = {row[0] for row in cursor} # cursors are iterable
>     needed_files = listing - known_files
>     cursor.executemany('INSERT INTO files VALUES (?, ?)', ((filename,
> 0) for filename in needed_files))

Oh, even better:

    add_files = listing - known_files
    delete_files = known_files - listing

and then I can remove files that have disappeared off the spool from
the table.  Thanks very much!

> Anyway. The significant thing is the performance of the database on
> two different workloads: either "give me everything that matches this
> pattern" (where the pattern ends with a percent sign), or "do you have
> this? do you have this? do you have this?". Generally, database
> indexing is fairly efficient at handling prefix searches, so the first
> query will basically amount to an index search, which is a lot faster
> than the repeated separate searching; it takes advantage of the fact
> that all the strings you're looking at will have the same prefix.
>
> There is one critical consideration, though. What happens if the
> directory name contains an underscore or percent sign? Or can you
> absolutely guarantee that they won't? You may need to escape them, and
> I'm not sure how SQLite handles that. (Possibly \_ will match literal
> _, and \\ will match literal \, or something like that.)

I can guarantee that the directory names are all
'/var/spool/news/message.id/' then 3 digits.  (The filenames are
pretty wild, since they are MIDs.)  AIUI, using the '?' substitution
in the sqlite3 library is supposed to be safe.

> This is not bypassing the database's optimization; in fact, it's
> working tidily within it. 

That's reassuring!

...
> But doing the set difference in Python is just as good a way of doing the job.

I like it.  Thanks very much.

-- 
Specifications are for the weak & timid!
          --- Klingon Programmer's Guide