Grouping pairs - suggested tools

Tue Sep 21 14:41:04 EDT 2010

> I think you have the same bug as Alf's code, you never merge existing
> groups. Have you tried Arnaud's counterexample?
>
> By the way, are ('a', 'b') and ('b', 'a') to be considered equivalent for
> your problem?
>
> Peter

Hi Peter,

Yes. I realise that this doesn't take into account existing
relationships/groupings. For the particular situation that I am
looking at this is unlikely to happen and/or not critical. However, I
understand that generally you would want to potentially assign an item
to one or more groups.

In my case the ('a','b') and ('b','a') are equivalent. Perhaps for
anyone else looking at this, I can elaborate on the problem to make it
a bit more concrete.

I have a very long listing of customer details and am trying to clean
the data. In particular I am looking for duplicates:

<< Core Data >>
id, Name, Address
a, Acme Ltd, 1 Main Street
b, Acme Limited, 1 Main St
c, Acme L'td, 1 Main Street
d, Smiths, 22 Upper Road
e, Smyths, 22 Upper Rd
f, Smiths ltd, 22 Upperrd
g, Apple Empire, 222 Lower Way
h, Apple Emp, 222 Lower Way

Obviously this is oversimplified. The actual dataset has thousands of
records. I am using the difflib module and comparing each item against
all those below it, and where the items are similar they are stored in
a paired table

<< Paired Data >>
id1, id2, relationship_strength
a, b, 0.8
a, c, 0.88
b, c, 0.8
d, e, 0.75
d, f, 0.88
e, f, 0.87
g, h, 0.77

However, these pairing aren't so easy to read and I want to include
the full address and account information so it's easier for the lucky
person who is going to clean the data. And this is where the grouping
cluster comes in.

<< Grouped Data >>
group_id, id
1, a
1, b
1, c
2, d
2, e
2, f
3, g
3, h

So in my situation those records that are very similar to each other
will be clustered together.

Thanks again.

ALJ