Databases and python

Fri Feb 17 02:25:35 EST 2006

About indexes everywhere: Yes, you don't have to be a DB expert to know
that indexes everywhere is bad. But look at this example. There are
really two ways that the data is going to get accessed in regular use.
Either they are going to ask for all files that have a word (most
likely) or they are going to see which words are in a file.

I'm going to have to name the tables now, aren't I? Here's a simple
schema:

words
--------
word_id
word

files
------
file_id
filename

word_files
--------------
file_id
word_id

If you are going to lookup by word, you'll need an index on words.word.
You'll also need an index on word_files.word_id. And then you'll need
an index on files.file_id.

If you are going to lookup by file, you'll need an index on
files.filename, word_files.file_id, and words.word_id.

So it ends up in this situation you need indexes everywhere.

Now, when you are doing the initial population, you should drop all the
indexes you don't need during population. That means everything but
words.word has to go. (You'll need to find the word_id for previously
seen words.) After the initial population, then is the time to build
and add the indexes. it's much faster to build an index when you have
the entire set of data in front of you than to do it piece-by-piece.
Some indexes actually get built better than they would've piecemeal.

Unfortunately this is no longer strictly topical to Python. But if you
free your mind from thinking in terms of SQL databases and look at
indexes as dicts or whatnot, then you can see that this is really a
general programming problem.