flattening/rolling up/aggregating a large sorted text file

Shane Geiger sgeiger at ncee.net
Wed Mar 21 14:45:21 EDT 2007


Apparently you want to use this data to know how many blue circles, blue 
squares, red circles and red squares.  In other words, I doubt you want 
to output redundant data columns, you just want this data in a more 
usable format and that you don't actually need to do multiple passes 
over it.

This is a fun problem to solve because it uses two very powerful tools: 
cvs.dictreader and bitwise categorization.

Note: your initial data has three records with the same ID.  I assumes 
the ID is the unique key.  So I changed the data slightly.





adtvff at yahoo.com wrote:
> Hi,
>
> Given a large ascii file (delimited or fixed width) with one ID field
> and dimensions/measures fields, sorted by dimensions, I'd like to
> "flatten" or "rollup" the file by creating new columns: one for each
> combination of dimension level, and summing up measures over all
> records for a given ID.
>
> If the wheel has already been invented, great, please point me in the
> right direction. If not, please share some pointers on how to think
> about this problem in order to write efficient code.
>
> Is a hash with dimension level combinations a good approach, with
> values reset at each new ID level?
>
> I know mysql, Oracle etc will do this , but they all have a cap on #
> of columns allowed. SAS will allow unlimited columns, but I don't own
> SAS.
>
> Thanks.
>
>
> ID,color,shape,msr1
> ------------------------------
> 001, blue, square,  4
> 001, red , circle,    5
> 001, red,  circle,    6
>
>
> ID, blue_circle, blue_square, red_circle, red_square
> --------------------------------------------------------------------------
> 001,0,4,11,0
> 002 ...
>
>   

-- 
Shane Geiger
IT Director
National Council on Economic Education
sgeiger at ncee.net  |  402-438-8958  |  http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: data_categories.py
URL: <http://mail.python.org/pipermail/python-list/attachments/20070321/8f14ee8f/attachment.ksh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sgeiger.vcf
Type: text/x-vcard
Size: 310 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20070321/8f14ee8f/attachment.vcf>


More information about the Python-list mailing list