very large dictionary

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat Aug 2 02:54:17 EDT 2008


On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote:

> Hello,
> 
> I tried to load a 6.8G large dictionary on a server that has 128G of
> memory. I got a memory error. I used Python 2.5.2. How can I load my
> data?

How do you know the dictionary takes 6.8G?

I'm going to guess an answer to my own question. In a later post, Simon 
wrote:

[quote]
I had a file bigrams.py with a content like below:

bigrams = {
", djy" : 75 ,
", djz" : 57 ,
", djzoom" : 165 ,
", dk" : 28893 ,
", dk.au" : 854 ,
", dk.b." : 3668 ,
...

}
[end quote]


I'm guessing that the file is 6.8G of *text*. How much memory will it 
take to import that? I don't know, but probably a lot more than 6.8G. The 
compiler has to read the whole file in one giant piece, analyze it, 
create all the string and int objects, and only then can it create the 
dict. By my back-of-the-envelope calculations, the pointers alone will 
require about 5GB, nevermind the objects they point to.

I suggest trying to store your data as data, not as Python code. Create a 
text file "bigrams.txt" with one key/value per line, like this:

djy : 75
djz : 57
djzoom : 165
dk : 28893
...

Then import it like such:

bigrams = {}
for line in open('bigrams.txt', 'r'):
    key, value = line.split(':')
    bigrams[key.strip()] = int(value.strip())


This will be slower, but because it only needs to read the data one line 
at a time, it might succeed where trying to slurp all 6.8G in one piece 
will fail.



-- 
Steven



More information about the Python-list mailing list