Databases and python

Thu Feb 16 22:14:00 EST 2006

On Thu, 16 Feb 2006 13:45:28 +0000, Bryan Olson wrote:

> Dan Stromberg wrote:
>> I've been putting a little bit of time into a file indexing engine
> [...]
> 
>> So far, I've been taking the approach of using a single-table database
>> like gdbm or dbhash [...] and making each entry keyed by
>> a word, and under the word in the database is a null terminated list of
>> filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation).
>> 
> [...]
>> the program just isn't performing like I'd like it to.
>  >
>> And I think it's because despite the caching and minimal representation
>> conversion, it's still just too slow converting linear lists to arrays
>> back and forth and back and forth.
> 
> I think I follow. The straightforward method of building the
> list of files associated with a word is order n**2 time. On each
> new entry, you look up the entire string, append one file-id to
> it, and write the new version back.

Yes, essentially, with the twist that the most recently used words are
kept cached in memory, so they don't have to be converted from string to
list and back to string every time a filename is added to a word.

>> So this leads me to wonder - is there a python database interface that
>> would allow me to define a -lot- of tables?  Like, each word becomes a
>> table, and then the fields in that table are just the filenames that
>> contained that word.  That way adding filenames to a word shouldn't bog
>> down much at all.
> 
> Well, you could use simple files instead of fancy database tables.

That's an interesting thought.  Perhaps especially if australopithecine
were saved in a filename like:

~/indices/au/st/ra/lo/pi/th/ec/in/e 

> Below is a demo of an alternate technique that uses bsddb B-Trees,
> and puts both the word and the file-id in the key. I don't know
> how efficient it is for real data, but at least the time won't grow
> as Theta(n**2).

Perhaps I'm missing something, but is it not roughly O(1) for
individual insertions, but O(n*m) (n == number of files, m == number of
words) for lookups?

> --Bryan
> 
> 
> 
> import bsddb
> import urllib
> 
> 
> def add_words_from_file(index, fname, word_iterator):
>      """ Pass the open-for-write bsddb B-Tree, a filename, and a list
>          (or any interable) of the words in the file.
>      """
>      fname = urllib.quote_plus(fname)
>      s = set()
>      for word in word_iterator:
>          if word not in s:
>              s.update(word)
>              key = '%s %s' % (urllib.quote_plus(word), fname)
>              index[key] = ''
>      index.sync()
> 
> 
> def lookup(index, word):
>      """ Pass the B-Tree (as built with add_words_from_file) and a
>          word to look up. Returns list of files containing the word.
>      """
>      word = urllib.quote_plus(word)
>      fname_list = []
>      try:
>          current = index.set_location(word)
>          while True:
>              (key, _) = current
>              (w, fname) = key.split()
>              if w != word:
>                  break
>              fname_list.append(urllib.unquote_plus(fname))
>              current = index.next()
>      except KeyError:
>          pass
>      return fname_list
> 
> 
> def test():
>      index = bsddb.btopen('junktest.bdb', 'n')
>      data =[
>          ('bryfile.txt', 'nor heed the rumble of a distant drum'),
>          ('junkfile.txt', 'this is the beast, the beast so sly'),
>          ('word file.txt', 'is this the way it always is here in Baltimore')
>      ]
>      for (fname, text) in data:
>          words = text.split()
>          add_words_from_file(index, fname, words)
> 
>      for word in ['is', 'the', 'heed', 'this', 'way']:
>          print '"%s" is in files: %s' % (word, lookup(index, word))
> 
> test()