selecting dictionaries to maximize counts
Steven Bethard
steven.bethard at gmail.com
Fri Feb 18 18:18:19 EST 2005
John Machin wrote:
> Steven Bethard wrote:
>
>> I have a list of dictionaries. Each dictionary holds counts of
>> various 'words'...
>> ...
>> Basically, I use a greedy approach -- adding a dict each time if I
>> can.
>> ...
>> This leads to some suboptimal solutions given that, while the total
>> counts must not exceed MAX_VALUE, I'd like them to be as close to
>> MAX_VALUE as possible. An example of data on which my select
>> function produces such a suboptimal solution:
>>
>> py> countdicts = [
>> ... dict(a=9, b=9, c=9),
>> ... dict(a=8, b=7),
>> ... dict(a=4, b=5, c=12)]
>> py> select(countdicts, 12)
>> [{'a': 9, 'c': 9, 'b': 9}]
>>
>> The better solution would have been to select the second two dicts --
>> then all 'words' would have count 12.
>>
>> I don't need an optimal solution; in fact, the solution I have right
>> now is tolerable. But if anyone knows a simple way to improve my
>> performance, I'd appreciate it. Thanks!
>
> You should specify a function that assigns a score to valid solutions;
> it's not apparent from the wording above and the example exactly what
> that function might be. Thinking through the alternatives might help
> you solve the problem yourself :-)
Yeah, I guess that's a good idea. Here's a scoring function that should
give the right range of scores (where 1.0 is perfect and 0.0 is failure):
py> from __future__ import division
py> def score(origdicts, selecteddicts, n):
... origwords = set(word for d in origdicts for word in d)
... counts = {}
... for d in selecteddicts:
... for word, count in d.iteritems():
... counts[word] = counts.get(word, 0) + count
... if counts[word] > n:
... return 0.0
... return sum(counts.itervalues())/len(origwords)/n
...
py> countdicts = [
... dict(a=9, b=9, c=9),
... dict(a=8, b=7),
... dict(a=4, b=5, c=12)]
py> score(countdicts, countdicts, 12)
0.0
py> score(countdicts, select(countdicts, 12), 12)
0.75
py> score(countdicts, [dict(a=8, b=7), dict(a=4, b=5, c=12)], 12)
1.0
It's pretty easy to build a configuration that can't be solved with a
perfect score:
py> def subsets(lst):
... if not lst:
... yield []
... else:
... first = lst[:1]
... for subset in subsets(lst[1:]):
... yield subset
... yield first + subset
...
py> countdicts = [
... dict(a=1, b=2, c=3),
... dict(a=4, b=5, c=6),
... dict(a=7, b=8, c=9)]
py> for dicts in subsets(countdicts):
... print score(countdicts, dicts, 9), dicts
...
0.0 []
0.222222222222 [{'a': 1, 'c': 3, 'b': 2}]
0.555555555556 [{'a': 4, 'c': 6, 'b': 5}]
0.777777777778 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}]
0.888888888889 [{'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c':
9, 'b': 8}]
Hmmm... I can't do an exhaustive search like this one here -- I have
~8000 dicts, so 2**N is definitely too large of a space to search. Now
that I have a scoring routine maybe I could do an A* search though...
Thanks for the input!
STeVe
More information about the Python-list
mailing list