Databases and python

Thu Feb 16 02:37:31 EST 2006

I'm no expert in BDBs, but I have spent a fair amount of time working
with PostgreSQL and Oracle. It sounds like you need to put some
optimization into your algorithm and data representation.

I would do pretty much like you are doing, except I would only have the
following relations:

- word to word ID
- filename to filename ID
- word ID to filename ID

You're going to want an index on pretty much every column in this
database. That's because you're going to lookup by any one of these
columns for the corresponding value.

I said I wasn't an expert in BDBs. But I do have some experience
building up large databases. In the first stage, you just accumulate
the data. Then you build the indexes only as you need them. Let's say
you are scanning your files. You won't need an index on the
filename-to-ID table. That's because you are just putting data in
there. The word-to-ID table needs an index on the word, but not ID
(you're not looking up by ID yet.) And the word ID-to-filename ID table
doesn't need any indexes yet either. So build up the data without the
indexes. Once your scan is complete, then build up the indexes you'll
need for regular operation. You can probably incrementally add data as
you go.

As far as filename ID and word IDs go, just use a counter to generate
the next number. If you use base255 as the number, you're really not
going to save much space.

And your idea of hundreds of thousands of tables? Very bad. Don't do
it.