Recursive generator for combinations of a multiset?

Tue Nov 26 05:33:06 EST 2013

On 26 November 2013 06:18, John O'Hagan <research at johnohagan.com> wrote:
>
> Thanks, that is exactly the type of thing I was after. Although it is
> actually slower as is than a hack like
>
> def imulticombs(it, n):
>     last = ()
>     for i in itertools.combinations(it,n):
>         if i > last:
>             yield i
>             last = i
>
> it can be modified to be a lot faster for my use-case like this:
>
> def _multicombs(prepend, words, r, chkstr):
>     """chkstr is the string of remaining availalable characters"""
>     if r == 0:
>         yield prepend, chkstr
>         return
>     (word, count, rem), *remaining = words
>     newstr = chkstr
>     for k in range(max(r-rem, 0), min(count, r) + 1):
>         for letter in word * k:
>             if letter in newstr:
>                 newstr = newstr.replace(letter, '' , 1)
>             else:
>                 return
>         yield from _multicombs(prepend + (word,) * k, remaining, r-k,
>         newstr)

Further micro-optimisations are possible. One is to inline the lower
recursion levels. For example if the termination condition is checked
immediately before recursion you can eliminate a redundant generator
function call.

> By progressively checking against the available characters, it avoids
> fruitless recursion branches. I'm not 100% sure I'm doing this right yet
> but so far it's promising. This is the quickest non-redundant approach
> I've used so far (although strangely, the original product-only version
> which produces a lot of redundant results is still by far the quickest.)

Bear in mind that the interpreter does a lot of  "redundant" things so
what you might think of as non-redundant code can actually be highly
redundant when interpreter over-head is accounted for.

Recently I've been trying out PyPY in my own work and it is *really
fast*. In fact it is so much faster that I've given up on the idea of
speed-optimising code for CPython: just use PyPy instead - although it
is worth optimising highly intensive code for PyPy.

> I'll try to explain better what I'm actually doing. It's quite long,
> but you seem to be wondering, or maybe someone else is!

[snip]

Okay I understand now. I was confused because I think of anagrams as
being words of the same length. I didn't understand how your multiword
version works but I think I do now. In my own words: You want to
generate all sequences of English words having the same letters as
some input string, ignoring the spaces in both the input and output
strings (so that the number or length of the individual words need not
be the same).

[snip]
>
> For example, take the phrase "break beat". I make a wordlength
> dictionary of sub-alphabet words, using the system dictionary file,
> like this:
>
>  {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate',
>  'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke',
>  'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4: ['abet',
>  'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate', 'beak',
>  'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake', 'rate',
>  'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5: ['abate',
>  'baker', 'beret', 'brake', 'break', 'eater', 'karat', 'kebab',
>  'taker'], 6: ['aerate', 'beaker', 'beater', 'berate', 'betake',
>  'karate', 'rebate', 'retake']}

Okay, how about the following (pseudoish-code)?

def word_sequences(prepend, target, subwords):
    # subwords is a list rather than a dict
    for n in range(1, len(subwords)):
        for word in subwords[n]:
            if equal_in_a_multiset_sense(word, target):
                yield prepend + ' ' + target
            elif is_a_submultiset(word, target):
                recurse_target = multiset_subtract(target, word)
                subsubwords = list_of_subwords_by_length[:len(recurse_target)]
                if any(subsubwords):
                    yield from word_sequences(prepend + ' ' + word,
recurse_target, subsubwords)

Your general approach seems reasonable to me now that I understand the
problem. The word_sequences generator above is the solution I would
probably write at first if faced with the problem. I suspect that your
current approach using partition sizes is better though.

Oscar