Dictionary/Hash question

Gabriel Genellina gagsl-py at yahoo.com.ar
Tue Feb 6 23:14:34 EST 2007


En Wed, 07 Feb 2007 00:28:31 -0300, Sick Monkey <sickcodemonkey at gmail.com>  
escribió:

> qualm after qualm.  Before you read this, my OS is Linux, up2date, and
> minimal RAM (512).
And Python 2.3 or earlier, I presume, else you would have the builtin set  
type.

> The files that my script needs to read in and interpret can contain  
> anywhere
> from 5 million lines to 65 million lines
>
> I have attached 2 versions of code for you to analyze.
> =================
> I am having issues with performance.
>
> Instance 1:  dict_compare.py {which is attached}
> Is awesome, in that I have read a file and stored it into a hash table,  
> but
> if you run it, the program decides to stall after writing all of the  
> date.
> <NOTE:  once you receive the statement "finished comparing 2 lists." the
> file has actually finished processing within 1 minute, but the script
> continues to run for additional minutes (10 additional minutes actually).
> <I dont know why>

This version reads both files FULLY into memory; maybe the delay time you  
see, is the deallocation of those two huge lists.

> Instance 2: dictNew.py
> Runs great but it is a little slower than Instance 1 (dict_compare.py).   
> BUT
> WHEN IT FINISHES, IT STOPS THE APPLICATION.... no  additional  
> minutes.....
> <NOTE: I was not yelling with the capitalization, but I am frustrated>

This version processes both files one line at a time, so the memory  
requirements are a lot lower.
I think it's a bit slower because the Set class is implemented in Python;  
set (Python 2.4) is a builtin type now.
You could combine both versions: use the dict approach from version 1, and  
process one line at a time as in version 2.
You can get the mails in both dictionaries like this:

for key in dict1:
   if key in dict2:
     print key

-- 
Gabriel Genellina




More information about the Python-list mailing list