Fuzzy string comparison

Wed Dec 27 16:59:22 EST 2006

Steve Bergman wrote:
> Thanks, all. Yes, Levenshtein seems to be the magic word I was looking
> for.  (It's blazingly fast, too.)
>
> I suspect that if I strip out all the punctuation, etc. from both the
> itemnumber and description columns, as suggested, and concatenate them,
> pairing the record with its closest match in the other file, it ought
> to be pretty accurate.  Obviously, the final decision will be up to a
> human being, but this should help them quite a bit.
>
> BTW, excluding all the items that match exactly, I only have 8000 items
> in one file to compare to 2600 in the other.  As fast as
> python-levenshtein seems to be, this should finish in well under a
> minute.

The above suggests that you plan to do a preliminary pass using exact
comparison, and remove exact-matching pairs from further consideration.
If that is the case, here are a few questions for you to ponder:

What about 789o123 in file A and 789o123 in file B? Are you concerned
about standardising your item-numbers?

What about cases like 7890123 and 789o123 in file A? Are you concerned
about duplicated records within a file?

What about cases like 7890123 and 789o123 in file A and 7890123 and
789o123 and 78-901-23 in file B? Are you concerned about grouping all
instances of the same item?
If you are, the magic phrase you are looking for is "union find".

HTH,
John