Perl and Python, a practical side-by-side example.

Bruno Desthuilliers bdesth.quelquechose at free.quelquepart.fr
Sun Mar 4 17:07:35 EST 2007


Shawn Milo a écrit :
(snip)
> The script reads a file from standard input and
> finds the best record for each unique ID (piid). The best is defined
> as follows: The newest expiration date (field 5) for the record with
> the state (field 1) which matches the desired state (field 6). If
> there is no record matching the desired state, then just take the
> newest expiration date.
> 

Here's a fixed (wrt/ test data) version with a somewhat better (and 
faster) algorithm using Decorate/Sort/Undecorate (aka schwarzian transform):

import sys
output = sys.stdout

input = [
     #ID  STATE ...  ...  ...  DATE TARGET
     "aaa\tAAA\t...\t...\t...\t20071212\tBBB\n",
     "aaa\tAAA\t...\t...\t...\t20070120\tAAA\n",
     "aaa\tAAA\t...\t...\t...\t20070101\tAAA\n",
     "aaa\tAAA\t...\t...\t...\t20071010\tBBB\n",
     "aaa\tAAA\t...\t...\t...\t20071111\tBBB\n",
     "ccc\tAAA\t...\t...\t...\t20071201\tBBB\n",
     "ccc\tAAA\t...\t...\t...\t20070101\tAAA\n",
     "ccc\tAAA\t...\t...\t...\t20071212\tBBB\n",
     "ccc\tAAA\t...\t...\t...\t20071212\tAAA\n",
     "bbb\tAAA\t...\t...\t...\t20070101\tBBB\n",
     "bbb\tAAA\t...\t...\t...\t20070101\tBBB\n",
     "bbb\tAAA\t...\t...\t...\t20071212\tBBB\n",
     "bbb\tAAA\t...\t...\t...\t20070612\tBBB\n",
     "bbb\tAAA\t...\t...\t...\t20071212\tBBB\n",
     ]

def find_best_match(input=input, output=output):
     PIID = 0
     STATE = 1
     EXP_DATE = 5
     DESIRED_STATE = 6

     recs = {}
     for line in input:
         line = line.rstrip('\n')
         row = line.split('\t')
         sort_key = (row[STATE] == row[DESIRED_STATE], row[EXP_DATE])
         recs.setdefault(row[PIID], []).append((sort_key, line))

     for decorated_lines in recs.itervalues():
         print >> output, sorted(decorated_lines, reverse=True)[0][1]

Lines are sorted first on  whether the state matches the desired state, 
then on the expiration date. Since it's a reverse sort, we first have 
lines that match (if any) sorted by date descending, then the lines that 
dont match sorted by date descending. So in both cases, the 'best match' 
is the first item in the list. Then we just have to get rid of the sort 
key, et voilà !-)

HTH



More information about the Python-list mailing list