Grouping pairs - suggested tools

Tue Sep 21 02:19:48 EDT 2010

* Alf P. Steinbach /Usenet, on 21.09.2010 01:09:
> * Astley Le Jasper, on 20.09.2010 23:42:
>> I have a list of tuples that indicate a relationship, ie a is related
>> to b, b is related to c etc etc. What I want to do is cluster these
>> relationships into groups. An item will only be associated with a
>> single cluster.
>>
>> Before I started, I wondered if there was any particular tool within
>> Python I should be looking at. I don't expect anyone to code this for
>> me, just say ... "you need to look at using x". I was going to use
>> populate a dictionary and
>>
>> Sorry for being so vague.
>>
>> Example Data:
>> [(a,b)
>> (a,c)
>> (a,d)
>> (b,c)
>> (b,d)
>> (c,d)
>> (e,f)
>> (e,g)
>> (f,g)	
>> (h,i)]
>>
>> Output (grouping id, item ref)
>> (1,a),
>> (1,b),
>> (1,c),
>> (1,d),
>> (2,e),
>> (2,f),
>> (2,g),
>> (3,h),
>> (3,i)
>
> It seems to be the same problem as "equivalence sets".
>
> This problem was solved early on because e.g. Fortran compilers had to construct
> such sets (equivalence partitions of a set).
>
> I though I'd just say "Google it!", because I know there's a standard algorithm
> but I can't recall the name. However, googling it I found no mention of that
> algorithm. Not even in the Wikipedia articles on equivalence sets.
>
> A number of approaches spring to mind:
>
> Approach 1:
> Multi-pass. Originally you assign a distinct set number to each symbol.
> In each pass through the symbols you replace one number with another as
> per one of the equivalence specs. Stop when no replacements are done.
>
> Approach 2:
> Merging. For each new equivalence A=B you check whether A has been assigned
> to a set, if not you create a new one, call that S1, and ditto for B, S2.
> Merge S1 and S2, e.g. move all elements of S2 to S1.
>
> Approach 3:
> In a first pass convert the data to more explicit form, linking each symbol
> to the symbols it's directly equivalent to. Then in second pass simply drag
> out each equivalence group (recurse on each symbol's list of equivalences).
>
> Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
> but I think approach 1 may be a practical option for a small number of symbols.

Uhm, thinking about it (it must have been my unconscious mind doing the work, it 
just popped into my head), if you first sort each individual equivalence 
relation so that you never have e.g. C=A but only A=C and so on, and then sort 
the list of equivalences, then it should reduce to walking the list and just 
starting a new set whenever a symbol is encountered that isn't yet in a set.

<code language="Py3">
class Attributes: pass

sym = Attributes()
for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
     setattr( sym, name, name )

eq_specs = [
     (sym.a, sym.d),
     (sym.b, sym.a),
     (sym.b, sym.c),
     (sym.b, sym.d),
     (sym.c, sym.d),
     (sym.c, sym.a),
     (sym.e, sym.f),
     (sym.e, sym.g),
     (sym.f, sym.g),
     (sym.h, sym.i),
     ]

equalities = []
for eq in eq_specs:
     sorted_eq = eq if eq[0] <= eq[1] else (eq[1], eq[0])
     equalities.append( sorted_eq )
equalities.sort()

eq_sets = {}
eq_set_ids = {}
current_eq_set_id = 0
for eq in equalities:
     if eq_set_ids.get( eq[0] ) is None:
         current_eq_set_id += 1
         eq_set_ids[eq[0]] = current_eq_set_id
         eq_sets[current_eq_set_id] = set( eq[0] )
     eq_set_id = eq_set_ids[eq[0]]
     eq_set_ids[eq[1]] = eq_set_id
     eq_sets[eq_set_id].add( eq[1] )

for eq_set_id in eq_sets.keys():
     for sym_name in eq_sets[eq_set_id]:
         print( "{}, {}".format( eq_set_id, sym_name ) )
</code>

<output>
1, a
1, c
1, b
1, d
2, e
2, g
2, f
3, i
3, h
</output>

Disclaimer: for me it's pretty late in day/night.

Cheers & hth.,

- Alf

-- 
blog at <url: http://alfps.wordpress.com>