difflib qualm

Thu Jan 25 19:49:53 EST 2007

Gabriel Genellina wrote:
> At Wednesday 24/1/2007 23:05, Sick Monkey wrote:
> 
>> I am trying to write a python script that will compare 2 files which
>> contains names (millions of them).
>>
>> More specifically, I have 2 files (Files1.txt and Files2.txt). 
>> Files1.txt contains 180 thousand names and Files2.txt contains 34
>> million names.
>>
>> I have a script which will analyze these two files and store them into
>> 2 different lists (fileList1 and fileList2 respectivly).  I have
>> imported the diflib library and after the lists are created, matching
>> on the following criteria " " for diflib -> (just the names that are
>> similar between the two files).
>>
>> This works perfectly for hundreds of names but is taking forever for
>> millions of them; thus not really efficient.
>>
>> Does anyone have any idea on how to get this more efficient? 
>> (speaking of Time and RAM)
>>
>> Any advice would be greatly appreciated.   (NOTE:  I have been trying
>> to study multithreading, but have not really grasp the concept.  So I
>> may need some examples.)
> 
> When you say "names" you mean people's names? So you want to match, say,
> Levenshtein Vladimir to Lebenstain V.? And you only have the names to
> match?
> Not a good candidate for difflib. See this paper:
> Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital
> Government Web Service (2003)
> Mohamed G. Elfeky, Vassilios S. Verykios, Ahmed K. Elmagarmid, Thanaa M.
> Ghanem, Ahmed R. Huwait.
> http://citeseer.ist.psu.edu/elfeky03record.html
> and you can get other pointers from the Levenshtein distance article in
> Wikipedia and http://en.wikipedia.org/wiki/Fuzzy_string_searching
> 
> 

One approach:

Put the big list of names in a database and create soundex keys for the names
and make the soundex keys an index so you can search quickly.  Databases
are really good at storing data that is searchable via an index.  If you REALLY
need speed you can consider an in-memory database.

Create soundex keys for each name in your small list and query the database
with this key into the table in the DB that is indexed on soundex keys.
If you get a hit, the key is sufficiently "alike" to be a candidate.  I'll
leave the remainder to you.  Perhaps there is other information that will
help determine if there is a match?

-Larry Bates