flattening/rolling up/aggregating a large sorted text file
Eddie Corns
eddie at holyrood.ed.ac.uk
Thu Mar 22 12:10:13 EDT 2007
adtvff at yahoo.com writes:
>Hi,
>Given a large ascii file (delimited or fixed width) with one ID field
>and dimensions/measures fields, sorted by dimensions, I'd like to
>"flatten" or "rollup" the file by creating new columns: one for each
>combination of dimension level, and summing up measures over all
>records for a given ID.
>If the wheel has already been invented, great, please point me in the
>right direction. If not, please share some pointers on how to think
>about this problem in order to write efficient code.
>Is a hash with dimension level combinations a good approach, with
>values reset at each new ID level?
>I know mysql, Oracle etc will do this , but they all have a cap on #
>of columns allowed. SAS will allow unlimited columns, but I don't own
>SAS.
>Thanks.
>ID,color,shape,msr1
>------------------------------
>001, blue, square, 4
>001, red , circle, 5
>001, red, circle, 6
>ID, blue_circle, blue_square, red_circle, red_square
>--------------------------------------------------------------------------
>001,0,4,11,0
>002 ...
Something like:
import sys
from sets import Set
ids = {}
keys = Set()
for line in sys.stdin:
ID,COL,SHAPE,VAL = [s.strip() for s in line.split(',')]
ids.setdefault(ID,{})
key = '%s_%s'%(COL,SHAPE)
ids[ID].setdefault(key,0)
ids[ID][key] += int(VAL)
keys.add(key)
print 'id',',',','.join([str(key) for key in keys])
for id,cols in ids.items():
print id,',', ', '.join([str(cols.get(k,0)) for k in keys])
Doesn't keep all possible keys just those that are actually used.
Needs to sort() things here and there.
Incidentally I don't think you could do it in SQL at all in this way but you
could do it in a more vertical fashion (eg
001, red, circle, 11
001, blue, square, 4
002, red, rhombus, 99) etc.
Eddie
More information about the Python-list
mailing list