Databases and python

Fri Feb 17 02:19:16 EST 2006

About the filename ID - word ID table: Any good database (good with
large amounts of data) will handle the memory management for you. If
you get enough data, it may make sense to get bothered with PostgreSQL.
That has a pretty good record on handling very large sets of data, and
intermediate sets as well. Again, I can't speak about BDBs, but
something in the back of my head is saying that the entire table is
loaded into memory. If so, that's not good for large sets of data.

About base255: You can think of strings as numbers. At least, that's
how your computer sees them. Converting a number to base255 is really
silly. If it's a series of bytes (like a Python string is) then base255
is a string. You want to use an incremented number for the ID because
there are some indexes that deal with this kind of data very, very
well. If you have your number spread out like a shotgun, with clusters
here and there, the index might not be very good.

About using many tables: The best answer you are going to get is a
bunch of files---one for each word---with the names or IDs of the
files. You can store these in a single directory, or (as you'll find
out to be more efficient) a tree of directories. But this kind of data
is useless for reverse lookups. If you really do have so much data,
start using a real database that is designed to handle this kind of
data efficiently. The overhead of SQL parsing and network latency soon
gets dwarfed by lookup times anyway.