Recursive generator for combinations of a multiset?

John O'Hagan research at johnohagan.com
Thu Nov 21 01:46:14 EST 2013


Short story: the subject says it all, so if you have an answer already,
fire away. Below is the long story of what I'm using it for, and why I
think it needs to be recursive. It may even be of more general
interest in terms of filtering the results of generators. 

I'm playing with an anagram-generating module which works like this:

1) Generate the integer partitions of the length of the input string.

2) For each partition, replace its elements with a list of all
dictionary words which are a) the same length as the partition
element and b) a sub-multiset of the input string.

eg: "cat in hat" -> ... , [3,5], ... -> [[act, ant, ...], [antic,
attic, ...]]

3) Pass each resulting list of wordlists to itertools.product, filtering
the output of anything which is not the same multiset of characters as
the input string.

This works but gets very slow for long strings. It spends most of its
time at the filtering stage because most of the results have too many
of some characters (and therefore not enough of others). I do some
optimising of the lists prior to running product, but this only shaves
off smallish percentages.

I got a big speedup (factor of five for medium-length inputs, much more
for longer strings) by replacing itertools.product with this recursive
generator:

def cartesian_product(args, input_str):
    if not args:
        yield (), input_str
        return
    for words, check_str in cartesian_product(args[:-1], input_str):
        for word in args[-1]:
            #this bit prevents bothering with anything useless
            new_str = check_str
            for letter in word:
                if letter in new_str:
                    new_str = new_str.replace(letter, '', 1)
                else:
                    break
            else:
                yield words + (word,), new_str

Despite being inherently much slower than itertools.product, it can
"prune" branches of the recursion as soon as it accumulates too many of
any character. This means it's much faster and produces correct results
without further filtering.

But there is another problem. The initial partitions contain repeated
elements, so the corresponding wordlists are also repeated. This means
that any cartesian product will contain many redundant results - the
same words in a different order. For a medium-sized string, this is
most of them. 

A solution for repeated sections of a partition of wordlists is to
use r-combinations (where r is the number of repeats). In this
scenario, though, some words may be usable more than once, and the
right number of copies of these words must be added to the list to
allow this. This means I need the combinations of a multiset (so I
can't use itertools.combinations).

I found a verbal description of such an algorithm and came up with
this:

def multicombs(it, r):
    result = it[:r]
    yield result
    while 1:
        for i in range(-1, -r - 1, -1):
            rep = result[i]
            if rep < it[i]:
                break
        else:
            break
        for j, n in enumerate(it):
            if n > rep:
                break
        result = result[:i] + it[j:j - i]
        yield result

I added a call to this generator in a branch to the main generator
above to deal with repeated elements. This eliminates redundant
results, but with a substantial slowdown. The program now spends most
of its time inside multicombs. 

I'm hoping that if I could find a recursive multiset combination
generator, I could speed it up using the same pruning approach.

Any suggestions?

--

John



More information about the Python-list mailing list