Newbie with sort text file question

Sat Jul 12 17:06:54 EDT 2003

At 12:46 PM 7/12/2003 -0700, stuartc wrote:

>Hi:
>
>I'm not a total newbie, but I'm pretty green.  I need to sort a text
>file and then get a total for the number of occurances for a part of
>the string. Hopefully, this will explain it better:
>
>Here's the text file:
>
>banana_c \\yellow
>apple_a \\green
>orange_b \\yellow
>banana_d \\green
>orange_a \\orange
>apple_w \\yellow
>banana_e \\green
>orange_x \\yellow
>orange_y \\orange
>
>I would like two output files:
>
>1) Sorted like this, by the fruit name (the name before the dash)
>
>apple_a \\green
>apple_w \\yellow
>banana_c \\yellow
>banana_d \\green
>banana_e \\green
>orange_a \\orange
>orange_b \\yellow
>orange_x \\yellow
>orange_y \\orange
>
>2) Then summarized like this, ordered with the highest occurances
>first:
>
>orange occurs 4
>banana occurs 3
>apple occurs 2
>
>Total occurances is 9

I am developing a Python version of IBM's CMS Pipelines, which is designed 
for this kind of task. If you'd like to be an early recipient (read beta 
tester) of this product, let me know.

You would invoke this program:
Pipe("""
   < c:\input.txt
   | split /_/
   | nlocate -//-
   | sort count
   | spec 11-* 1 / occurs / 11 1-10 19
   | > c:\output1.txt
   | count
   | spec /Total occurrences is / 1 1-* 21
   | > c:\output2.txt""")

Explanation:
|               == separates each stage of the pipe
<               == read records from file
split           == split each record into 2 records at first _
nlocate == select records that do not contain //
pad 10  == ensure each record has 10 characters (or whatever the longest 
fruit name is)
sort count      == sort; group by unique key and prepend count
spec ...        == select cols 11-end of input, append literal, append cols 
1-10
 >               == write records to file
spec ...        == start with literal, append rest of record
 >               == write records to file

Or it can be run as a DOS Command:
C>python pipe.py spec.txt
where spec.txt contains the pipe specification

An enhancement to the IBM Pipeline specification for SPLIT will be to route 
the 2nd part of each record to the secondary output, effectively discarding 
it in this example, and eliminating the need for the NLOCATE stage.

This particular task can also be done fairly easily in Python. The appeal 
of Pipe is that you focus on the specification rather than writing Python 
code that is specific to the task. This shortens development time, and 
enhances readability and maintainability.

The Python version:

input = file('c:\input.txt')
fruits = {} # a dictionary to hold each fruit and its count
lines = input.readlines()
for line in lines:
   fruit = line.split('_', 1)[0]
   if fruit in fruits:
     fruits[fruit] += 1 # increment count
   else:
     fruits[fruit] = 1 # add to dictionary with count of 1
output1 = file('c:\output1.txt', 'w')
for key, value in fruits.items():
   output1.write("%s occurs %s\n" % (key, value))
output1.close()
output2 = file('c:\output2.txt', 'w')
output2.write("Total occurrences is %s\n" % len(lines))
output2.close()

Bob Gailer
bgailer at alum.rpi.edu
303 442 2625
-------------- next part --------------

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.500 / Virus Database: 298 - Release Date: 7/10/2003