Grouping pairs - suggested tools

Tue Sep 21 05:13:06 EDT 2010

On Sep 21, 7:19 am, "Alf P. Steinbach /Usenet" <alf.p.steinbach
+use... at gmail.com> wrote:
> * Alf P. Steinbach /Usenet, on 21.09.2010 01:09:
>
>
>
>
>
> > * Astley Le Jasper, on 20.09.2010 23:42:
> >> I have a list of tuples that indicate a relationship, ie a is related
> >> to b, b is related to c etc etc. What I want to do is cluster these
> >> relationships into groups. An item will only be associated with a
> >> single cluster.
>
> >> Before I started, I wondered if there was any particular tool within
> >> Python I should be looking at. I don't expect anyone to code this for
> >> me, just say ... "you need to look at using x". I was going to use
> >> populate a dictionary and
>
> >> Sorry for being so vague.
>
> >> Example Data:
> >> [(a,b)
> >> (a,c)
> >> (a,d)
> >> (b,c)
> >> (b,d)
> >> (c,d)
> >> (e,f)
> >> (e,g)
> >> (f,g)  
> >> (h,i)]
>
> >> Output (grouping id, item ref)
> >> (1,a),
> >> (1,b),
> >> (1,c),
> >> (1,d),
> >> (2,e),
> >> (2,f),
> >> (2,g),
> >> (3,h),
> >> (3,i)
>
> > It seems to be the same problem as "equivalence sets".
>
> > This problem was solved early on because e.g. Fortran compilers had to construct
> > such sets (equivalence partitions of a set).
>
> > I though I'd just say "Google it!", because I know there's a standard algorithm
> > but I can't recall the name. However, googling it I found no mention of that
> > algorithm. Not even in the Wikipedia articles on equivalence sets.
>
> > A number of approaches spring to mind:
>
> > Approach 1:
> > Multi-pass. Originally you assign a distinct set number to each symbol.
> > In each pass through the symbols you replace one number with another as
> > per one of the equivalence specs. Stop when no replacements are done.
>
> > Approach 2:
> > Merging. For each new equivalence A=B you check whether A has been assigned
> > to a set, if not you create a new one, call that S1, and ditto for B, S2.
> > Merge S1 and S2, e.g. move all elements of S2 to S1.
>
> > Approach 3:
> > In a first pass convert the data to more explicit form, linking each symbol
> > to the symbols it's directly equivalent to. Then in second pass simply drag
> > out each equivalence group (recurse on each symbol's list of equivalences).
>
> > Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
> > but I think approach 1 may be a practical option for a small number of symbols.
>
> Uhm, thinking about it (it must have been my unconscious mind doing the work, it
> just popped into my head), if you first sort each individual equivalence
> relation so that you never have e.g. C=A but only A=C and so on, and then sort
> the list of equivalences, then it should reduce to walking the list and just
> starting a new set whenever a symbol is encountered that isn't yet in a set.
>
> <code language="Py3">
> class Attributes: pass
>
> sym = Attributes()
> for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
>      setattr( sym, name, name )
>
> eq_specs = [
>      (sym.a, sym.d),
>      (sym.b, sym.a),
>      (sym.b, sym.c),
>      (sym.b, sym.d),
>      (sym.c, sym.d),
>      (sym.c, sym.a),
>      (sym.e, sym.f),
>      (sym.e, sym.g),
>      (sym.f, sym.g),
>      (sym.h, sym.i),
>      ]
>
> equalities = []
> for eq in eq_specs:
>      sorted_eq = eq if eq[0] <= eq[1] else (eq[1], eq[0])
>      equalities.append( sorted_eq )
> equalities.sort()
>
> eq_sets = {}
> eq_set_ids = {}
> current_eq_set_id = 0
> for eq in equalities:

This would make the body of the loop easier to read:

  for x, y in equalities:

>      if eq_set_ids.get( eq[0] ) is None:

Why not simply:

       if eq[0] in eq_set_ids:

>          current_eq_set_id += 1
>          eq_set_ids[eq[0]] = current_eq_set_id
>          eq_sets[current_eq_set_id] = set( eq[0] )
>      eq_set_id = eq_set_ids[eq[0]]
>      eq_set_ids[eq[1]] = eq_set_id
>      eq_sets[eq_set_id].add( eq[1] )
>
> for eq_set_id in eq_sets.keys():
>      for sym_name in eq_sets[eq_set_id]:
>          print( "{}, {}".format( eq_set_id, sym_name ) )
> </code>
>
> <output>
> 1, a
> 1, c
> 1, b
> 1, d
> 2, e
> 2, g
> 2, f
> 3, i
> 3, h
> </output>
>
> Disclaimer: for me it's pretty late in day/night.
>
> Cheers & hth.,
>
> - Alf
>
> --
> blog at <url:http://alfps.wordpress.com>

I think this won't work with the following graph:

eq_specs = [('a', 'c'), ('b', 'd'), ('c', 'd')]

Note that it is already sorted according to your sorting method.  I
don't have a Python machine to check this but I believe it will
output:

1, a
1, c
1, d
2, b

The flaw is that you fail to consider that the two vertices in the
current pair may already be part of a (partial) connected component.

--
Arnaud