Map lots of words to lots of integers

Warren Postma embed at geocities.com
Thu May 4 09:28:42 EDT 2000


>
> I need a fast way of mapping words to integers. A single word must be
> able to point to many, *many*, integers. Tried stuff like a dict,
> words as keys, pointing to a list of integers. This is all fine and
> nice if the thing is located in memory. I want to (or need ) to store
> all of this on disk. And the method must be fast. Thought I could use
> a Berkley DB file using words as keys, but what should they point to?

I use the Berkely DB Version 2.0 - the version that ships with Python 1.5 is
very old, archaic even.  Hopefully Python 1.6 or later will actually be up
to date. Look for Robin Dunn's precompiled BSDDB 2.x for Python.

The berkeley db is very fast on both insert and retrieve of data, the key
can be any string. For my own combination of speed and flexibility, I use a
marshalled key and marshalled data tuple, and return the whole row as a
list. In my database application, I have rows of N columns, arranged with N
key fields first (I sometimes have more than one field forming my primary
key) and append the lists together when I get the row from the database like
so:

row = marshal.loads( key) + marshal.loads( bsddb[key] ) # unmarshal key list
and append the data list.

To store a row, I store it like this:

bsddb[ marshal.dumps( row[1:keycount] ) ] = marshal.dumps( row[keycount:] )

I have wrapped this all in a class, by the way, because this would otherwise
be UGLY!

Anyways, the row objects I deal with are lists, with key fields followed by
data fields, and my row object has an attribute called keycount allowing me
to split the rows for most efficient storage. Storing the key data as part
of the rest of the row is redundant and wasteful. here's a sample row:

[ 'key field 1', 'key field 2',  1,2,3,4,5,.... ]

Assuming the key was composed of a list of two items, followed by N data
values of any primitive Python type.

I think Marshalling is a wee bit faster than cPickle, if all you need is a
flat array of simple Python types, no class instances, etcetera. I chose to
do it this way rather than use dbShelve, because I found pickling wasteful
and too large for my embedded system, which is constricted for storage
space.  If the right side of your bsddb is always going to be integers, you
could use something else on the right side instead of marshal, such as
xstruct, or some other binary array manipulation.

If you want some sample code, feel free to email me and I'll send it to you.

Warren





More information about the Python-list mailing list