Recursive generator for combinations of a multiset?

John O'Hagan research at johnohagan.com
Tue Nov 26 01:18:41 EST 2013


On Mon, 25 Nov 2013 12:15:15 +0000
Oscar Benjamin <oscar.j.benjamin at gmail.com> wrote:

> On 21 November 2013 13:01, John O'Hagan <research at johnohagan.com>
> wrote:
> > In my use-case the first argument to multicombs is a tuple of words
> > which may contain duplicates, and it produces all unique
> > combinations of a certain length of those words, eg:
> >
> > list(multicombs(('cat', 'hat', 'in', 'the', 'the'), 3))
> >
> > [('cat', 'hat', 'in'), ('cat', 'hat', 'the'), ('cat', 'in', 'the'),
> > ('cat', 'the', 'the'), ('hat', 'in', 'the'), ('hat', 'the', 'the'),
> > ('in', 'the', 'the')]
> 
> I still don't understand what you're actually doing well enough to
> know whether there is a better general approach to the problem. For
> the specific thing you requested, here is a recursive multiset
> combinations generator. Does it do what you wanted?
> 
> #!/usr/bin/env python
> 
> def multicombs(it, r):
>     words = []
>     last = None
>     for N, word in enumerate(it):
>         if word == last:
>             words[-1][1] += 1
>         else:
>             words.append([word, 1])
>             last = word
>     cumulative = 0
>     for n in range(len(words)-1, -1, -1):
>         words[n].append(cumulative)
>         cumulative += words[n][1]
>     return _multicombs((), words, r)
> 
> def _multicombs(prepend, words, r):
>     if r == 0:
>         yield prepend
>         return
>     (word, count, rem), *remaining = words
>     for k in range(max(r-rem, 0), min(count, r) + 1):
>         yield from _multicombs(prepend + (word,) * k, remaining, r-k)

Thanks, that is exactly the type of thing I was after. Although it is
actually slower as is than a hack like

def imulticombs(it, n):
    last = ()
    for i in itertools.combinations(it,n):
        if i > last:
            yield i
            last = i

it can be modified to be a lot faster for my use-case like this:

def _multicombs(prepend, words, r, chkstr):
    """chkstr is the string of remaining availalable characters"""
    if r == 0:
        yield prepend, chkstr
        return
    (word, count, rem), *remaining = words
    newstr = chkstr
    for k in range(max(r-rem, 0), min(count, r) + 1):
        for letter in word * k:
            if letter in newstr:
                newstr = newstr.replace(letter, '' , 1)
            else:
                return
        yield from _multicombs(prepend + (word,) * k, remaining, r-k,
        newstr)

By progressively checking against the available characters, it avoids
fruitless recursion branches. I'm not 100% sure I'm doing this right yet
but so far it's promising. This is the quickest non-redundant approach
I've used so far (although strangely, the original product-only version
which produces a lot of redundant results is still by far the quickest.)

This is how I'm calling it:

def cartesian_product(args, instr):
    if not args:
        yield (),instr
        return
    for items, chkstr in cartesian_product(args[:-1], instr):
        #this prevents bothering with anything useless
        words, repeats = args[-1]
        if repeats == 1:
            for word in words:
                newstr = chkstr
                for letter in word:
                    if letter in newstr:
                        newstr = newstr.replace(letter, '' , 1)
                    else:
                        break
                else:
                    yield items + (word,), newstr
        else:
            for item, newstr in multicombs(words, repeats, chkstr):
                yield items + item, newstr

I'll try to explain better what I'm actually doing. It's quite long,
but you seem to be wondering, or maybe someone else is!

It's an old anagram program I wrote for practice a few years ago which
produces the phrases of dictionary words which together contains the
same letter-multiset as the input phrase. I would be interested to know
if you think there is a better general approach. My approach is: 

- make a dictionary of candidate words composed of the same
  sub-alphabet as the input, using their lengths as keys, then
 
- for each integer partition of the length of the input, replace each
  partition element with the corresponding value from the dictionary,
  then 

- Take the cartesian product of the resulting list of lists of words,
  rejecting any output which exceeds any of the letter frequencies of
  the input 

I have omitted descibing some complex optimisations of the lists, which
gave only fractional speedups.

However, I recently discovered that I could get an orders-of-magnitude
speedup by using my own recursive cartesian product algorithnm which
checks letter frequencies on the fly, rather than using
itertools.product and testing each output phrase. (This algorithm was
in my original post). 

For example, take the phrase "break beat". I make a wordlength
dictionary of sub-alphabet words, using the system dictionary file,
like this:

 {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate',
 'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke',
 'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4: ['abet',
 'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate', 'beak',
 'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake', 'rate',
 'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5: ['abate',
 'baker', 'beret', 'brake', 'break', 'eater', 'karat', 'kebab',
 'taker'], 6: ['aerate', 'beaker', 'beater', 'berate', 'betake',
 'karate', 'rebate', 'retake']}


The input has nine letters, and the partitions of
nine are:

9
1, 8
2, 7
1, 1, 7
3, 6
... etc

and for each partition, I grab the relevant lists from the dictionary,
eg. the first one for which dictionary entries are available is (3, 6)
which gives:

 [(['are', 'ark', 'art', 'ate', 'baa', 'bar', 'bat', 'bee', 'bet',
'bra', 'ear', 'eat', 'ebb', 'eke', 'era', 'ere', 'eta', 'rat', 'tab',
'tar', 'tea', 'tee'], 1), (['aerate', 'beaker', 'beater', 'berate',
'betake', 'karate', 'rebate', 'retake'], 1)]

The '1' in each tuple above is the number of repeats of the partition
element.

Then I run cartesian_product on the lists, and so on for all the
partitions.  The full result is like:

bar betake 
bat beaker 
betake bra 
[... many more ...]
bra eke tab 
a ark be bet 
ark at be be 


The caveat, and subject of my original post, is that if a partition
element is repeated, we run into complications. In this example, for the
partition (3, 3, 3), if we take the product of three copies of this
list:

['are', 'ark', 'art', 'ate', 'baa', 'bar', 'bat', 'bee', 'bet', 'bra',
'ear', 'eat', 'ebb', 'eke', 'era', 'ere', 'eta', 'rat', 'tab', 'tar',
'tea', 'tee']

we get a lot of repetition:

['ark', 'ate', 'ebb']
['ark', 'ate', 'ebb']
['ark', 'ate', 'ebb']
['ark', 'ate', 'ebb']
['ark', 'ate', 'ebb']
['ark', 'ate', 'ebb']
[... many more ...]
['bra', 'eke', 'tab']
['bra', 'eke', 'tab']
['bra', 'eke', 'tab']
['bra', 'eke', 'tab']
['bra', 'eke', 'tab']
['bra', 'eke', 'tab']

I sorted this output to demonstrate the problem. For longer phrases,
the bulk of the output is redundant.

To fix this we could use use itertools.combinations(list,
number_of_repeats), but this creates another issue: if there are enough
letters in the input for particular output words to be repeated,
itertools.combinations will miss the repetitions. In this example, the
word 'be' can occur twice in an output phrase. To allow for this
repetition using combinations, we have to insert another copy of it
into the list of two-letter words. But now we have redundancy again, as
every combination using 'be' will also be repeated!

The solution is to use multiset combinations, but doing so to date has
cost much of the big speedup I got by using a self-pruning recursive
version of cartesian product. I'm hoping that some more tweaking of an
algorithm like yours will allow the same great speed improvement I got
for the cartesian product version. 

For now, although to a lesser extent, I have to choose between
redundancy and slowness!

Regards,

--

John



More information about the Python-list mailing list