aggregation for a nested dict

Chris Rebert clp2 at rebertia.com
Thu Dec 2 15:07:35 EST 2010


On Thu, Dec 2, 2010 at 11:01 AM, chris <ozric at web.de> wrote:
> Hi,
>
> i would like to parse many thousand files and aggregate the counts for
> the field entries related to every id.
>
> extract_field grep the identifier for the fields with regex.
>
> result = [ { extract_field("id", line) : [extract_field("field1",
> line),extract_field("field2", line)]}  for line  in FILE ]
>
> result gives me.
> {'a: ['0', '84']},
> {'a': ['0', '84']},
> {'b': ['1000', '83']},
> {'b': ['0', '84']},
>
> i like to aggregate them for every line or maybe file and get after
> the complete parsing procedure
> the possibility to count the amount of ids  having > 0 entries in
> '83'.
>
> {'a: {'0':2, '84':2}}
> {'b': {'1000':1,'83':1,'84':1} }

Er, what happened to the '0' for 'b'?

> My current solution with mysql is really slow.

Untested:

# requires Python 2.7+ due to Counter
from collections import defaultdict, Counter

FIELDS = ["field1", "field2"]

id2counter = defaultdict(Counter)
for line in FILE:
    identifier = extract_field("id", line)
    counter = id2counter[identifier]
    for field_name in FIELDS:
        field_val = int(extract_field(field_name, line))
        counter[field_val] += 1

print(id2counter)
print(sum(1 for counter in id2counter.values() if counter[83]))

Cheers,
Chris
--
http://blog.rebertia.com



More information about the Python-list mailing list