selecting dictionaries to maximize counts

Fri Feb 18 18:18:19 EST 2005

John Machin wrote:
> Steven Bethard wrote:
> 
>> I have a list of dictionaries.  Each dictionary holds counts of
>> various 'words'...
>> ...
>> Basically, I use a greedy approach -- adding a dict each time if I
>> can.
>> ... 
>> This leads to some suboptimal solutions given that, while the total
>> counts must not exceed MAX_VALUE, I'd like them to be as close to
>> MAX_VALUE as possible.  An example of data on which my select
>> function produces such a suboptimal solution:
>>
>> py> countdicts = [
>> ...     dict(a=9, b=9, c=9),
>> ...     dict(a=8, b=7),
>> ...     dict(a=4, b=5, c=12)]
>> py> select(countdicts, 12)
>> [{'a': 9, 'c': 9, 'b': 9}]
>>
>> The better solution would have been to select the second two dicts --
>> then all 'words' would have count 12.
>>
>> I don't need an optimal solution; in fact, the solution I have right
>> now is tolerable.  But if anyone knows a simple way to improve my
>> performance, I'd appreciate it.  Thanks!
> 
> You should specify a function that assigns a score to valid solutions;
> it's not apparent from the wording above and the example exactly what
> that function might be. Thinking through the alternatives might help
> you solve the problem yourself :-)

Yeah, I guess that's a good idea.  Here's a scoring function that should 
give the right range of scores (where 1.0 is perfect and 0.0 is failure):

py> from __future__ import division
py> def score(origdicts, selecteddicts, n):
...     origwords = set(word for d in origdicts for word in d)
...     counts = {}
...     for d in selecteddicts:
...         for word, count in d.iteritems():
...             counts[word] = counts.get(word, 0) + count
...             if counts[word] > n:
...                 return 0.0
...     return sum(counts.itervalues())/len(origwords)/n
...
py> countdicts = [
...     dict(a=9, b=9, c=9),
...     dict(a=8, b=7),
...     dict(a=4, b=5, c=12)]
py> score(countdicts, countdicts, 12)
0.0
py> score(countdicts, select(countdicts, 12), 12)
0.75
py> score(countdicts, [dict(a=8, b=7), dict(a=4, b=5, c=12)], 12)
1.0

It's pretty easy to build a configuration that can't be solved with a 
perfect score:

py> def subsets(lst):
...     if not lst:
...         yield []
...     else:
...         first = lst[:1]
...         for subset in subsets(lst[1:]):
...             yield subset
...             yield first + subset
...
py> countdicts = [
...     dict(a=1, b=2, c=3),
...     dict(a=4, b=5, c=6),
...     dict(a=7, b=8, c=9)]
py> for dicts in subsets(countdicts):
...     print score(countdicts, dicts, 9), dicts
...
0.0 []
0.222222222222 [{'a': 1, 'c': 3, 'b': 2}]
0.555555555556 [{'a': 4, 'c': 6, 'b': 5}]
0.777777777778 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}]
0.888888888889 [{'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c': 
9, 'b': 8}]

Hmmm...  I can't do an exhaustive search like this one here -- I have 
~8000 dicts, so 2**N is definitely too large of a space to search.  Now 
that I have a scoring routine maybe I could do an A* search though...

Thanks for the input!

STeVe