[Tutor] dictionaries and memory handling

Arild B. Næss arildna at stud.ntnu.no
Fri Feb 23 18:30:40 CET 2007


Hi,

I'm working on a python script for a task in statistical language  
processing. Briefly put it all boils down to counting different  
things in very large text files, doing simple computations on these  
counts and storing the results. I have been using python's dictionary  
type as my basic data structure of storing the counts. This has been  
a nice and simple solution, but turns out to be a bad idea in the  
long run, since the dictionaries become _very_ large, and create  
MemoryErrors when I try to run my script on texts of a certain size.

It seems that an SQL database would probably be the way to go, but I  
am a bit concerned about speed issues (even though running time is  
not all that crucial here). In any case it would probably take me a  
while to get a database up and running and I need to hand in some  
preliminary results pretty soon, so for now I think I'll postpone the  
SQL and try to tweak my current script to be able to run it on  
slightly longer texts than it can handle now.

So, enough beating around the bush, my questions are:

- Will the dictionaries take up less memory if I use numbers rather  
than words as keys (i.e. will {3:45, 6:77, 9:33} consume less memory  
than {"eloquent":45, "helpless":77, "samaritan":33} )? And if so:  
Slightly less, or substantially less memory?

- What are common methods to monitor the memory usage of a script?  
Can I add a snippet to the code that prints out how many MBs of  
memory a certain dictionary takes up at that particular time?

regards,
Arild Næss


More information about the Tutor mailing list