Databases and python

Dan Stromberg strombrg at dcs.nac.uci.edu
Thu Feb 16 23:12:40 EST 2006


On Wed, 15 Feb 2006 23:37:31 -0800, Jonathan Gardner wrote:

> I'm no expert in BDBs, but I have spent a fair amount of time working
> with PostgreSQL and Oracle. It sounds like you need to put some
> optimization into your algorithm and data representation.
> 
> I would do pretty much like you are doing, except I would only have the
> following relations:
> 
> - word to word ID
> - filename to filename ID
> - word ID to filename ID

I think I could ditch the "word ID to word" table, but I think I'll
probably eventually need the "filename ID to filename" table, so when I
pull a list of filename ID's by word I can get back the filenames.

> You're going to want an index on pretty much every column in this
> database. That's because you're going to lookup by any one of these
> columns for the corresponding value.

I do need some pretty heavy indexing.

In fact, so far everything seems to be clicking along pretty well, except
the part that I'm having trouble indexing - the per-word list of
filenames.  And that's because I already have one relation (word
number -> filenames' numbers) that's complicating getting another relation
for the list of filenames (just filename ID to '', but it'd be soooo much
faster if I could nest the relations somehow).

> I said I wasn't an expert in BDBs. But I do have some experience
> building up large databases. In the first stage, you just accumulate the
> data. Then you build the indexes only as you need them. Let's say you
> are scanning your files. You won't need an index on the filename-to-ID
> table. That's because you are just putting data in there. The word-to-ID
> table needs an index on the word, but not ID (you're not looking up by
> ID yet.) And the word ID-to-filename ID table doesn't need any indexes
> yet either. So build up the data without the indexes. Once your scan is
> complete, then build up the indexes you'll need for regular operation.
> You can probably incrementally add data as you go.

Interesting thought!  I think I need to ponder this a bit more.

I think the filename <-> filename number tables are pretty inexpensive,
and the word <-> word number tables seem pretty inexpensive too.  I think
it's primarily the word number to filename numbers table that's messing
up the performance really bad, and that's probably because the list of
filename numbers is growing so large in some cases, and adding a filename
to a word tends to be O(n), n==the number of filenames already on the word.

May I ask: In what representation would you keep this word number to
filename numbers relation stashed away until later, to build your indexes
from more rapidly in a later pass?  Are you thinking to keep them all in
memory, or store them in a flat file somehow?

> As far as filename ID and word IDs go, just use a counter to generate
> the next number. If you use base255 as the number, you're really not
> going to save much space.

That's basically what I'm doing - incrementing a counter for each word,
except I'm converting that counter to base255.  The base255 representation
of that counter brings the following benefits:

1) Converting a set of numbers to a string, without losing number
boundaries, becomes just mapping them to base255, and
string.joinfields'ing them.

2) Converting a string to a set of numbers, again without losing number
boundaries, is just string.splitfields'ing them, and mapping them from
base255 back to a list or set of numbers.

3) The numbers can be arbitrarily large, unlike with a, say, 4 or 8 byte
fixed-length record representation

> And your idea of hundreds of thousands of tables? Very bad. Don't do it.

Sigh.  I was afraid of that. :-S

I'm thinking though...  what if I were to use the current representation
for words that don't appear in that many files, and a table for each word
that has >= 1000 (or whatever threshold) filenames associated with it...?

That should greatly cut down on the number of tables, while still giving
the more efficient representation for the words that need it.

Or not?





More information about the Python-list mailing list