Map lots of words to lots of integers

‘5ÛHH575-UAZWKVVP-7H2H48V3 thomas at cintra.no
Thu May 4 10:16:33 EDT 2000


On Thu, 4 May 2000 09:28:42 -0400, "Warren Postma"
<embed at geocities.com> wrote:

>>
>> I need a fast way of mapping words to integers. A single word must be
>> able to point to many, *many*, integers. Tried stuff like a dict,
>> words as keys, pointing to a list of integers. This is all fine and
>> nice if the thing is located in memory. I want to (or need ) to store
>> all of this on disk. And the method must be fast. Thought I could use
>> a Berkley DB file using words as keys, but what should they point to?
>
>I use the Berkely DB Version 2.0 - the version that ships with Python 1.5 is
>very old, archaic even.  Hopefully Python 1.6 or later will actually be up
>to date. Look for Robin Dunn's precompiled BSDDB 2.x for Python.
>
>The berkeley db is very fast on both insert and retrieve of data, the key
>can be any string. For my own combination of speed and flexibility, I use a
>marshalled key and marshalled data tuple, and return the whole row as a
>list. In my database application, I have rows of N columns, arranged with N
>key fields first (I sometimes have more than one field forming my primary
>key) and append the lists together when I get the row from the database like
>so:
>
>row = marshal.loads( key) + marshal.loads( bsddb[key] ) # unmarshal key list
>and append the data list.
>
>To store a row, I store it like this:
>
>bsddb[ marshal.dumps( row[1:keycount] ) ] = marshal.dumps( row[keycount:] )
>
>I have wrapped this all in a class, by the way, because this would otherwise
>be UGLY!
>
>Anyways, the row objects I deal with are lists, with key fields followed by
>data fields, and my row object has an attribute called keycount allowing me
>to split the rows for most efficient storage. Storing the key data as part
>of the rest of the row is redundant and wasteful. here's a sample row:
>
>[ 'key field 1', 'key field 2',  1,2,3,4,5,.... ]
>
>Assuming the key was composed of a list of two items, followed by N data
>values of any primitive Python type.
>
>I think Marshalling is a wee bit faster than cPickle, if all you need is a
>flat array of simple Python types, no class instances, etcetera. I chose to
>do it this way rather than use dbShelve, because I found pickling wasteful
>and too large for my embedded system, which is constricted for storage
>space.  If the right side of your bsddb is always going to be integers, you
>could use something else on the right side instead of marshal, such as
>xstruct, or some other binary array manipulation.
>
>If you want some sample code, feel free to email me and I'll send it to you.
>
>Warren
>

That looks like my last approach. Used dbhash and stored stuff like

db[key] = [(321,32,473),(54,543,34),(543543,5453,43)]

Each tuple forms an id. 

1. Would it make any sense to pickle/marshall the list before storing
it?
2. What if the list is HUGE? It`s read from disk, but still; it gotta
go into memory. How many ids can I store like this using tuple of
integers before we`re talking about gigabytes of data or the will the
size of data affect the speed of lookups? I mean, then the list of
tuples/integers are several megabytes, if not hundred of megabytes,
won`t the damn thing eventually crash and burn? 

In my application I need to make an intersection of several lists
returned from lookups in the database, so if they`re all huge I might
run out of memory. 

I`m indexing files on cd-roms and the number of ids can be up into the
millions.

Thanks for your input.

Thomas



More information about the Python-list mailing list