Scalable python dict {'key_is_a_string': [count, some_val]}

geremy condra debatem1 at gmail.com
Wed Mar 10 13:49:59 EST 2010


On Wed, Mar 10, 2010 at 11:47 AM, Krishna K <krishna.k.0001 at gmail.com> wrote:
>
>
> On Fri, Feb 19, 2010 at 11:27 PM, Jonathan Gardner
> <jgardner at jonathangardner.net> wrote:
>>
>> On Fri, Feb 19, 2010 at 10:36 PM, krishna <krishna.k.0001 at gmail.com>
>> wrote:
>> > I have to manage a couple of dicts with huge dataset (larger than
>> > feasible with the memory on my system), it basically has a key which
>> > is a string (actually a tuple converted to a string) and a two item
>> > list as value, with one element in the list being a count related to
>> > the key. I have to at the end sort this dictionary by the count.
>> >
>> > The platform is linux. I am planning to implement it by setting a
>> > threshold beyond which I write the data into files (3 columns: 'key
>> > count some_val' ) and later merge those files (I plan to sort the
>> > individual files by the key column and walk through the files with one
>> > pointer per file and merge them; I would add up the counts when
>> > entries from two files match by key) and sorting using the 'sort'
>> > command. Thus the bottleneck is the 'sort' command.
>> >
>> > Any suggestions, comments?
>> >
>>
>> You should be using BDBs or even something like PostgreSQL. The
>> indexes there will give you the scalability you need. I doubt you will
>> be able to write anything that will select, update, insert or delete
>> data better than what BDBs and PostgreSQL can give you.
>>
>> --
>> Jonathan Gardner
>> jgardner at jonathangardner.net
>
> Thank you. I tried BDB, it seems to get very very slow as you scale.
>
> Thank you,
> Krishna

Have you tried any of the big key-value store systems, like couchdb etc?

Geremy Condra



More information about the Python-list mailing list