Storing pairs of (int, int) in a database : which db to choose ?

Tue Dec 23 07:59:57 EST 2003

andreif at mail.dntis.ro (Stormbringer) writes:
> I want to implement a fulltext search for messages in a forum. More
> exactly for each message I store pairs (wordId, msgId) for each
> identified word and when I search something I want to be able to
> retrieve very quickly all msgId for a given wordId.

Using any kind of database in a big message system may thrash your
disk pretty badly if you update the database every time a new message
is entered.  That means making hundreds of new database entries, one
for each word in the message, each entry needing a disk seek.  If
you're adding a few new messages a second, it will be hard for your
disk to keep up.  

Anyway you don't need a real database.  You're not doing that kind of
accesses, you don't need multi-phase transactions, blah blah blah.  I
think the way to do it instead is keep several tables:

  1) a static table on disk, simply (wordId, msgId) pairs of 4-bit
     binary ints, sorted on wordId.  You can also have an index for
     it: a table with one entry for each wordId, saying where in
     the big table the msgId's for that word start.

  2) an update log on disk, (wordId, msgId) pairs like above, but
     unsorted.

  3) an in-memory table (Python dict) containing the same pairs as
     the update log, but using a dict structure for instant lookup.

When a new message comes in, you just append all its (wordId,msgId)
pairs to the update log (just append, no random seeking needed), and
also update the in-memory table.  Actually you don't need to store
(wordId,msgId) for every pair.  Just append a record that says
"next nnn records are for msgId xxx" and then only store the wordId's.

When a search query comes in, look in the in-memory table by hashing,
and if the word isn't found, look in the static table by binary search.

Then once a night, update the static table by running the update log
through a sort utility and then merging it with the static table.
Then you can delete the update log and in-memory table and begin them
again.  The sort-merge phase is relatively fast because sorting
utilities are written to sling around large buffers, minimizing disk
head motion.

If the system crashes or you stop the server for some reason, on
restart you can simply read the update log into memory.

You probably don't even need the in-memory structure if you're not
taking too many search queries--the update log should be small enough
that you can just sequentially scan it when a query comes in.

The main idea is that you replace a lot of random small updates with
occasional big sorting operations.  The big sorts are far more
i/o-efficient than the small updates.  These are the methods they
always used in the old days, when even random-access disk space was
very expensive, so big data sets were always stored on completely
sequential media (magnetic tape).