Searching for lots of similar strings (filenames) in sqlite3 database

Adam Funk a24061 at ducksburg.com
Wed Jul 2 05:32:55 EDT 2014


On 2014-07-01, Chris Angelico wrote:

> On Wed, Jul 2, 2014 at 1:15 AM, Adam Funk <a24061 at ducksburg.com> wrote:
>> On 2014-07-01, Chris Angelico wrote:

>>> There is one critical consideration, though. What happens if the
>>> directory name contains an underscore or percent sign? Or can you
>>> absolutely guarantee that they won't? You may need to escape them, and
>>> I'm not sure how SQLite handles that. (Possibly \_ will match literal
>>> _, and \\ will match literal \, or something like that.)
>>
>> I can guarantee that the directory names are all
>> '/var/spool/news/message.id/' then 3 digits.  (The filenames are
>> pretty wild, since they are MIDs.)  AIUI, using the '?' substitution
>> in the sqlite3 library is supposed to be safe.
>
> This is nothing to do with question-mark substitution. There are two
> separate levels of character significance here - it's like a quoted
> string with a regex. Suppose you want to make a regex that searches
> for an apostrophe. If you try to define that in a single-quoted
> string, you need to escape it:
>
> regex = '^\'$'
>
> However, if you ask the user to enter a regex, that wouldn't be necessary:
>
> regex = input("Enter a pattern: ") # raw_input in Python 2
> Enter a pattern: ^'$
>
> This is what the question mark substitution is like - it avoids the
> need to carefully manage string delimiters and so on. However, if you
> want to make a regex that searches for a backslash, then you need to
> escape it, because the backslash is important to the regex itself. In
> the same way, the underscore and percent sign are significant to the
> LIKE operator. If it were possible to have a directory name with a
> percent sign in it, it would match far too much - because you'd
> construct a LIKE pattern something like (ahem)
> "/var/spool/news/message%20id/142/%" - and as you can see, the percent
> sign at the end is no different from the percent sign in the middle.
>
> But you're safe because you know your data, unrelated to your
> substitution method. Possibly merits a comment... but possibly not
> worth it.

Well, I've changed it to the following anyway.

        subdir_glob = subdir + '/*'
        cursor.execute('SELECT filename FROM files WHERE filename GLOB ?',
                       (subdir_glob,))
        rows = cursor.fetchall()
        known_files = {row[0] for row in rows}

I see what you mean about paths containing '%', but I don't see why
you were concerned about underscores, though.


-- 
You know, there are many people in the country today who, through no
fault of their own, are sane. Some of them were born sane. Some of
them became sane later in their lives.            --― Graham Chapman



More information about the Python-list mailing list