flattening/rolling up/aggregating a large sorted text file

Steve Holden steve at holdenweb.com
Wed Mar 21 08:49:39 EDT 2007


adtvff at yahoo.com wrote:
> Hi,
> 
> Given a large ascii file (delimited or fixed width) with one ID field
> and dimensions/measures fields, sorted by dimensions, I'd like to
> "flatten" or "rollup" the file by creating new columns: one for each
> combination of dimension level, and summing up measures over all
> records for a given ID.
> 
> If the wheel has already been invented, great, please point me in the
> right direction. If not, please share some pointers on how to think
> about this problem in order to write efficient code.
> 
> Is a hash with dimension level combinations a good approach, with
> values reset at each new ID level?
> 
> I know mysql, Oracle etc will do this , but they all have a cap on #
> of columns allowed. SAS will allow unlimited columns, but I don't own
> SAS.
> 
> Thanks.
> 
> 
> ID,color,shape,msr1
> ------------------------------
> 001, blue, square,  4
> 001, red , circle,    5
> 001, red,  circle,    6
> 
> 
> ID, blue_circle, blue_square, red_circle, red_square
> --------------------------------------------------------------------------
> 001,0,4,11,0
> 002 ...
> 
It seems a bit wrong-headed to force this problem to fit a solution 
where you define relations with a variable number of columns when the 
natural way to solve it would seem to be to sum the msr1 values for each 
unique combination of ID, color and shape. That's a pretty 
straightforward relational problem.

So, is there some reason the result *has* to have that variable number 
of columns?

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd          http://www.holdenweb.com
Skype: holdenweb     http://del.icio.us/steve.holden
Recent Ramblings       http://holdenweb.blogspot.com




More information about the Python-list mailing list