Fuzzy string comparison

John Machin sjmachin at lexicon.net
Wed Dec 27 16:59:22 EST 2006


Steve Bergman wrote:
> Thanks, all. Yes, Levenshtein seems to be the magic word I was looking
> for.  (It's blazingly fast, too.)
>
> I suspect that if I strip out all the punctuation, etc. from both the
> itemnumber and description columns, as suggested, and concatenate them,
> pairing the record with its closest match in the other file, it ought
> to be pretty accurate.  Obviously, the final decision will be up to a
> human being, but this should help them quite a bit.
>
> BTW, excluding all the items that match exactly, I only have 8000 items
> in one file to compare to 2600 in the other.  As fast as
> python-levenshtein seems to be, this should finish in well under a
> minute.

The above suggests that you plan to do a preliminary pass using exact
comparison, and remove exact-matching pairs from further consideration.
If that is the case, here are a few questions for you to ponder:

What about 789o123 in file A and 789o123 in file B? Are you concerned
about standardising your item-numbers?

What about cases like 7890123 and 789o123 in file A? Are you concerned
about duplicated records within a file?

What about cases like 7890123 and 789o123 in file A and 7890123 and
789o123 and 78-901-23 in file B? Are you concerned about grouping all
instances of the same item?
If you are, the magic phrase you are looking for is "union find".

HTH,
John




More information about the Python-list mailing list