how to flatten & lump lists

Sun Dec 31 18:25:08 EST 2000

[cpsoct at my-deja.com]

/F covered the flatten part, so let's look at the other one:

> ...
> additionally i would like to do the opposite, take a list like
> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>
> and have it lump up into subgroups (randomly) and return:
> [[1],[2],[3, 4, 5], [6, 7], [8, 9, 10]]

Well, the problem seems ill-specified.  Try this for a start:

import random

def clump(seq):
    result = []
    i, n = 0, len(seq)
    # invariant:  result already contains a clumping of seq[:i]
    while i < n:    # while seq[i:] still needs to be clumped
        j = random.randint(i+1, n)  # assuming empty clumps no good
        result.append(seq[i:j])
        i = j
    return result

Then, e.g.,

for i in range(5):
    print clump(range(1, 11))

may print

[[1], [2, 3, 4, 5, 6, 7, 8], [9], [10]]
[[1, 2, 3, 4, 5, 6, 7, 8, 9], [10]]
[[1, 2, 3, 4, 5, 6, 7], [8], [9, 10]]
[[1, 2, 3, 4, 5, 6, 7, 8], [9], [10]]
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]

As that suggests, you're much more likely to get "large" clumps at the start
than at the end with this algorithm.  That may or may not suit your needs --
you have to define "random" in a sense appropriate for your application.

Fair warning:  generating random instances of large structured objects
"fairly" can be extremely subtle.  In this case, if you want every possible
way of clumping to be equally likely, there's a pretty simple way:

def clump(seq):
    if not seq:
        return []
    result = [seq[:1]]
    for e in seq[1:]:
        # either extend last clump or start a new one, with
        # equal odds
        if random.random() < 0.5:
            # extend last clump
            result[-1].append(e)
        else:
            # start a new clump
            result.append([e])
    return result

Proof sketch:  let the total number of ways to clump a sequence of length N
be C(N).  C(1) == 1 by inspection.  Suppose N > 1, and consider the first
element E of the sequence.  Either E is in a clump by itself, or it isn't.
If it is by itself, the rest of the clumping is a clump of a list of N-1
elements, and there are C(N-1) ways to do that.  Else E is not by itself, so
there's something other than E in the first piece of the clump.  Removing E
from the first piece of the clump thus again yields a clumping of a list of
N-1 elements, and again there are C(N-1) ways to do that.  So E is an a
clump by itself in exactly half the cases.

BTW, this shows that C(N)==2**(N-1).  A more direct way to see that is to
picture the elements in a line:

   e0 e1 e2 e3

and draw vertical bars at the clump boundaries.  For example,

    [[e0], [e1, e2], [e3]]

corresponds to

    e0|e1 e2|e3

If there are N elements, there are N-1 positions at which we *could* draw a
vertical bar.  At each position, we can either "draw a bar or not", so there
are 2**(N-1) total ways of choosing which bars to draw.  That's very
directly what the "if" test in the algorithm is doing:  choosing to draw a
bar (start a new clump) or not (extend the last clump) with equal odds.

good-programming-reduces-to-drawing-lines<wink>-ly y'rs  - tim