[Tutor] managing memory large dictionaries in python

emile emile at fenx.com
Wed Oct 17 00:21:43 CEST 2012


On 10/16/2012 01:03 PM, Prasad, Ramit wrote:
> Abhishek Pratap wrote:
>> Sent: Tuesday, October 16, 2012 11:57 AM
>> To: tutor at python.org
>> Subject: [Tutor] managing memory large dictionaries in python
>>
>> Hi Guys
>>
>> For my problem I need to store 400-800 million 20 characters keys in a
>> dictionary and do counting. This data structure takes about 60-100 Gb
>> of RAM.
>> I am wondering if there are slick ways to map the dictionary to a file
>> on disk and not store it in memory but still access it as dictionary
>> object. Speed is not the main concern in this problem and persistence
>> is not needed as the counting will only be done once on the data. We
>> want the script to run on smaller memory machines if possible.
>>
>> I did think about databases for this but intuitively it looks like a
>> overkill coz for each key you have to first check whether it is
>> already present and increase the count by 1  and if not then insert
>> the key into dbase.
>>
>> Just want to take your opinion on this.
>>
>> Thanks!
>> -Abhi
>
> I do not think that a database would be overkill for this type of task.

Agreed.

> Your process may be trivial but the amount of data it has manage is not trivial. You can use a simple database like SQLite. Otherwise, you
> could create a file for each key and update the count in there. It will
> run on a small amount of memory but will be slower than using a db.

Well, maybe -- depends on how many unique entries exist.  Most vanilla 
systems are going to crash (or give the appearance thereof) if you end 
up with millions of file entries in a directory.  If a filesystem based 
answer is sought, I'd consider generating 16-bit CRCs per key and 
appending the keys to the CRC named file, then pass those, sort and do 
the final counting.

Emile



More information about the Tutor mailing list