[SciPy-user] Splitting up and Joining Sets

Wed May 28 20:51:25 EDT 2008

2008/5/27 Lorenzo Isella <lorenzo.isella at gmail.com>:

> I do not know for sure whether a SciPy array is the right tool for
> what I have in mind, but I would like to post this here anyhow.
> Say that you have a set of elements (plain integers):
> {1,2,3,4,5,6,7,8,9,10}, which can end up split into several subsets
> [no element is repeated from a subset to another, I always have 10
> elements as a whole]. Ordering does not matter, so {3,4,5} is the same
> as {5,3,4}. and {{1,2},{7,3}} is the same as {{7,3},{2,1}}.
> Now, this is what I would like to do:
> (1) Given e.g. the subsets {1,4,10},{3,7,6},{5,8,2} {9} I would like
> to be able to find out which sets were merged/split in a new
> configuration, e.g. {1,4,10,9} ,{3,7,6},{5,8,2} (joining of two sets)
> or {1,4,10},{3,7,6},{5,8} ,{9}, {2} (splitting of two sets).
> (2) There then may be a more subtle case, in which the total number of
> sets stays the same, but the structure of the sets changes e.g.:
> {1,4,10},{3,7,6},{5,8,2} {9} ---> {1,4,10},{3,7,6},{5,8} {9,2}, which
> I also would like to be able to detect, finding out which elements
> left which subset to end up in another one.
> (3) Even if a scipy array was not the most suitable structure to use
> for these manipulations, that is the way the data are originally
> manipulated in my code, so please let me know how I should convert
> them into another data type (if needed at all).

If you want efficient algorithms to work with these objects, they are
known as "partitions of a set" or "equivalence relations", and there
are various references. One of the volumes of Knuth's The Art of
Computer Programming has a section on equivalence relations, if I
recall correctly.

If you're going to go about it this way, you might consider
representing each partition as an array A of integers, in which A[i]
is the smallest integer in the set containing i. Then:

(1) for a crude O(N**2) algorithm:

def refinement(A,B): # untested!
    for s in np.unique(A): # for each set
        if len(np.unique(B[A==s]))>1: # if it came from more than one
            return False
    return True

Faster, more arcane algorithms are certainly possible:
def refinement(A,B): # untested!
    idx = np.argsort(A) # idx lets you easily get a list of elements
in a given subset
    return np.all(np.diff(B[idx])[np.diff(A[idx])==0]==0)

(2) can probably be accomplished similarly depending on exactly what
you want. Keep in mind that there may be no obvious way to associate
sets in the "before" and "after" partitions. You could describe how to
get from one to the other refining the first one, then joining some
sets of the refined partiton to get the second in a minimal way; this
seems to be called taking the "meet" of the two partitions. Here's a
crude algorithm:

def meet(A,B): # untested
    r = np.arange(len(A))
    C = B.copy()
    for s in np.unique(A):
        ss = A==s
        for t in np.unique(B):
            ts = B==t
            C[ss & ts] = r[ss & ts][0]

Given C the meet of A and B, a reasonable list of elements that have
"moved" is (C!=A) | (C!=B). Not ideal, sometimes it will say too many
elements moved.

Good luck,
Anne