difflib qualm

Wed Jan 24 21:26:26 EST 2007

At Wednesday 24/1/2007 23:05, Sick Monkey wrote:

>I am trying to write a python script that will compare 2 files which 
>contains names (millions of them).
>
>More specifically, I have 2 files (Files1.txt and 
>Files2.txt).  Files1.txt contains 180 thousand names and Files2.txt 
>contains 34 million names.
>
>I have a script which will analyze these two files and store them 
>into 2 different lists (fileList1 and fileList2 respectivly).  I 
>have imported the diflib library and after the lists are created, 
>matching on the following criteria " " for diflib -> (just the names 
>that are similar between the two files).
>
>This works perfectly for hundreds of names but is taking forever for 
>millions of them; thus not really efficient.
>
>Does anyone have any idea on how to get this more 
>efficient?  (speaking of Time and RAM)
>
>Any advice would be greatly appreciated.   (NOTE:  I have been 
>trying to study multithreading, but have not really grasp the 
>concept.  So I may need some examples.)

When you say "names" you mean people's names? So you want to match, 
say, Levenshtein Vladimir to Lebenstain V.? And you only have the 
names to match?
Not a good candidate for difflib. See this paper:
Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital 
Government Web Service (2003)
Mohamed G. Elfeky, Vassilios S. Verykios, Ahmed K. Elmagarmid, Thanaa 
M. Ghanem, Ahmed R. Huwait.
http://citeseer.ist.psu.edu/elfeky03record.html
and you can get other pointers from the Levenshtein distance article 
in Wikipedia and http://en.wikipedia.org/wiki/Fuzzy_string_searching

-- 
Gabriel Genellina
Softlab SRL 

__________________________________________________ 
Preguntá. Respondé. Descubrí. 
Todo lo que querías saber, y lo que ni imaginabas, 
está en Yahoo! Respuestas (Beta). 
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas