Databases and python

Bryan Olson fakeaddress at nowhere.org
Thu Feb 16 08:45:28 EST 2006


Dan Stromberg wrote:
> I've been putting a little bit of time into a file indexing engine
[...]

> So far, I've been taking the approach of using a single-table database
> like gdbm or dbhash [...] and making each entry keyed by
> a word, and under the word in the database is a null terminated list of
> filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation).
> 
[...]
> the program just isn't performing like I'd like it to.
 >
> And I think it's because despite the caching and minimal representation
> conversion, it's still just too slow converting linear lists to arrays
> back and forth and back and forth.

I think I follow. The straightforward method of building the
list of files associated with a word is order n**2 time. On each
new entry, you look up the entire string, append one file-id to
it, and write the new version back.

> So this leads me to wonder - is there a python database interface that
> would allow me to define a -lot- of tables?  Like, each word becomes a
> table, and then the fields in that table are just the filenames that
> contained that word.  That way adding filenames to a word shouldn't bog
> down much at all.

Well, you could use simple files instead of fancy database tables.

Below is a demo of an alternate technique that uses bsddb B-Trees,
and puts both the word and the file-id in the key. I don't know
how efficient it is for real data, but at least the time won't grow
as Theta(n**2).

--Bryan



import bsddb
import urllib


def add_words_from_file(index, fname, word_iterator):
     """ Pass the open-for-write bsddb B-Tree, a filename, and a list
         (or any interable) of the words in the file.
     """
     fname = urllib.quote_plus(fname)
     s = set()
     for word in word_iterator:
         if word not in s:
             s.update(word)
             key = '%s %s' % (urllib.quote_plus(word), fname)
             index[key] = ''
     index.sync()


def lookup(index, word):
     """ Pass the B-Tree (as built with add_words_from_file) and a
         word to look up. Returns list of files containing the word.
     """
     word = urllib.quote_plus(word)
     fname_list = []
     try:
         current = index.set_location(word)
         while True:
             (key, _) = current
             (w, fname) = key.split()
             if w != word:
                 break
             fname_list.append(urllib.unquote_plus(fname))
             current = index.next()
     except KeyError:
         pass
     return fname_list


def test():
     index = bsddb.btopen('junktest.bdb', 'n')
     data =[
         ('bryfile.txt', 'nor heed the rumble of a distant drum'),
         ('junkfile.txt', 'this is the beast, the beast so sly'),
         ('word file.txt', 'is this the way it always is here in Baltimore')
     ]
     for (fname, text) in data:
         words = text.split()
         add_words_from_file(index, fname, words)

     for word in ['is', 'the', 'heed', 'this', 'way']:
         print '"%s" is in files: %s' % (word, lookup(index, word))

test()



More information about the Python-list mailing list