Compare list entry from csv files

Tue Nov 27 10:05:57 EST 2012

On 2012-11-27, Anatoli Hristov <tolidtm at gmail.com> wrote:
> Thanks for your help. I will do my best for the forum :)
>
> I advanced a little bit with the algorithm and at least I can
> now extract and compare the fields :) For my beginner skills I
> think this is too much for me. Now next step is to add the
> second field with the number to the Namelist and copy it to a
> third filename I suppose.

I had to write a similar type of program, and I imagine it's a
common problem. Sometimes new students provide incorrect SSN's or
simply leave them blank. This makes it impossible for us to match
their application for financial aid to their admissions record.

You have to analyze how you're going to match records.

In my case, missing SSN's are one case. A likeley match in this
case is when the names are eerily similar.

In the other case, where they simply got their SSN wrong, I have
to check for both a similar SSN and a similar name.

But you still have to define "similar." I looked up an algorithm
on the web called Levenshtein Distance, and implemented it like
so.

def levenshteindistance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in range(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

The algorithm return a count of every difference between the two
strings, from 0 to the length of the longest string.

Python provides difflib, which implements a similar algorithm, so
I used that as well (kinda awkwardly). I used
difflib.get_close_matches to get candidates, and then
difflib.SequenceMatcher to provide me a score measuring the
closeness.

matches = difflib.get_close_matches(s1, s2)
for m in matches:
    scorer = difflib.SequenceMatcher(None, s1, m)
    ratio = scorer.ratio()
    if ratio == 0.0:
        # perfect match
    if ratio > MAX_RATIO: # You gotta choose this. I used 0.1 
        # close match

The two algorithms come up with different guesses, and I pass on
their suggestions for fixes to a human being. Both versions of
the program take roughly 5 minutes to run the comparison on
2000-12000 records between the two files.

I like the results of Levenshtein distance a little better, but
difflib finds some stuff that it misses.

In your case, the name is munged horribly in one of the files so
you'll first have to first sort it out somehow.

-- 
Neil Cerutti