Searching for lots of similar strings (filenames) in sqlite3 database

Chris Angelico rosuav at gmail.com
Tue Jul 1 08:06:02 EDT 2014


On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk <a24061 at ducksburg.com> wrote:
>         cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,))

Shouldn't this be an equality check rather than IS, which normally I'd
expect to be "IS NULL" or "IS NOT NULL"?

As to your actual question: Your two database lookups are doing
distinctly different things, so there's no surprise that they perform
very differently. B asks the database "Do you have this? Do you have
this?" for every file you have, and C asks the database "What do you
have?", and then comparing that against the list of files. By the way
- the A+C technique could be done quite tidily as a set difference:

# assume you have listing1 and cursor set up
# as per your above code
    listing = {os.path.join(directory, x) for x in listing1}
    cursor.execute(...) # as per above
    known_files = {row[0] for row in cursor} # cursors are iterable
    needed_files = listing - known_files
    cursor.executemany('INSERT INTO files VALUES (?, ?)', ((filename,
0) for filename in needed_files))

Anyway. The significant thing is the performance of the database on
two different workloads: either "give me everything that matches this
pattern" (where the pattern ends with a percent sign), or "do you have
this? do you have this? do you have this?". Generally, database
indexing is fairly efficient at handling prefix searches, so the first
query will basically amount to an index search, which is a lot faster
than the repeated separate searching; it takes advantage of the fact
that all the strings you're looking at will have the same prefix.

There is one critical consideration, though. What happens if the
directory name contains an underscore or percent sign? Or can you
absolutely guarantee that they won't? You may need to escape them, and
I'm not sure how SQLite handles that. (Possibly \_ will match literal
_, and \\ will match literal \, or something like that.)

This is not bypassing the database's optimization; in fact, it's
working tidily within it. If you want to express your logic in a way
that lets the database truly be in command of optimization, it would
look something like this:

INSERT INTO files SELECT filename,0 FROM
(VALUES('foo'),('bar'),('quux'),('spam')) AS needed(filename) EXCEPT
SELECT filename,0 FROM files

And you'd need to check the SQLite docs for how to define a table that
consists of literal values. (The above syntax works in PostgreSQL; I'm
pretty sure it's not all standard SQL.)

But doing the set difference in Python is just as good a way of doing the job.

ChrisA



More information about the Python-list mailing list