Perl and Python, a practical side-by-side example.

Fri Mar 2 20:36:19 EST 2007

Shawn Milo a écrit :
> I'm new to Python and fairly experienced in Perl, although that
> experience is limited to the things I use daily.
> 
> I wrote the same script in both Perl and Python, and the output is
> identical. The run speed is similar (very fast) and the line count is
> similar.
> 
> Now that they're both working, I was looking at the code and wondering
> what Perl-specific and Python-specific improvements to the code would
> look like, as judged by others more knowledgeable in the individual
> languages.
> 
> I am not looking for the smallest number of lines, or anything else
> that would make the code more difficult to read in six months. Just
> any instances where I'm doing something inefficiently or in a "bad"
> way.
> 
(snip)
> 
> #! /usr/bin/env python
> 
> import sys
> 
> input = sys.stdin
> 
> recs = {}
> 
> for row in input:
>     row = row.rstrip('\n')
>     piid = row.split('\t')[0]
>     if recs.has_key(piid) is False:

'is' is the identity operator - practically, in CPython, it compares 
memory addresses. You *dont* want to use it here.

Another way to test if a key is already in a dict is to use the 'in' 
operator, ie:
       if not piid in recs:

>         recs[piid] = []
>     recs[piid].append(row)

As a last point, dicts have a setdefault(key, default) method that 
returns the value associated with key if it already exists, else add the 
key=>default pair and returns default, but it's a little bit slower.

for row in input:
     row = row.rstrip('\n')
     piid = row.split('\t')[0]
     recs.setdefault(piid, []).append(row)

or

for row in input:
     row = row.rstrip('\n')
     piid = row.split('\t')[0]
     if piid not in recs:
         recs[piid] = []
     recs[piid].append(row)

> for piid in recs.keys():

You don't need to call key() here - it's already the default iteration 
for dicts.

But since you want the values too, you should use dict.items() or 
dict.iteritems():

for piid, rows in recs.iteritems()

>     best = ""
>     for current in recs[piid]:
>         if best == "":
>             best = current;

Get rid of this ";" !-)

Better to use None - it's the "no value" object, and it let you use an 
identity test:

        best = None
        for current in rows:
           if best is None:
               best = row

And while we're at it, you can save yourself a test here:

        best = row[0]
        for current in row[1:]:
           # start tests

>             #If the current record is the correct state
>             if current.split("\t")[1] == current.split("\t")[6]:
>                 #If the existing record is the correct state
>                 if best.split("\t")[1] == best.split("\t")[6]:
>                     #If the new record has a newer exp. date
>                     if current.split("\t")[5] > best.split("\t")[5]:

You're repeatingly calling split() on the same objects. This is a 
serious waste of time and CPU.

>                         best = current
>                 else:
>                     best = current
>             else:
>                 #If the existing  record does not have the correct state
>                 #and the new record has a newer exp. date
>                 if best.split("\t")[1] != best.split("\t")[6] and
> current.split("\t")[5] > best.split("\t")[5]:
>                     best = current
>            
>     print best

Here's a somewhat corrected version:
# ---
import sys

def findbests(input=sys.stdin, output=sys.stdout):
     DATE = 5
     TARGET = 6
     STATE = 1
     recs = {}

     for row in input:
         row = row.rstrip('\n').split("\t")
         piid = row[0]
         if piid not in recs:
             recs[piid] = []
         recs[piid].append(row)

     for piid, rows in recs.iteritems():
         best = rows[0]
         for current in rows[1:]:
             if current[STATE] == current[TARGET]:
                 if best[STATE] == best[TARGET]:
                     if current[DATE] > best[DATE]:
                         best = current
                 else:
                     best = current
             elif best[STATE] != best[TARGET] \
                  and current[DATE] > best[DATE]:
                 best = current

         print >> output, "\t".join(best)

if __name__ == '__main__':
     findbdest()

# ---

It's somewhat shorter, a bit more readable IMHO, and somewhat faster too 
(=~ 30/40% faster on my machine). Also, it's usable as both a program 
and an importable module, and with any line input and file-like output.

Now for the bad news: I'm afraid your algorithm is broken : here are my 
test data and results:

input = [
     #ID  STATE ...  ...  ... TARG DATE
     "aaa\tAAA\t...\t...\t...\tBBB\t20071212\n",
     "aaa\tAAA\t...\t...\t...\tAAA\t20070120\n",
     "aaa\tAAA\t...\t...\t...\tAAA\t20070101\n",
     "aaa\tAAA\t...\t...\t...\tBBB\t20071010\n",
     "aaa\tAAA\t...\t...\t...\tBBB\t20071111\n",
     "ccc\tAAA\t...\t...\t...\tBBB\t20071201\n",
     "ccc\tAAA\t...\t...\t...\tAAA\t20070101\n",
     "ccc\tAAA\t...\t...\t...\tBBB\t20071212\n",
     "ccc\tAAA\t...\t...\t...\tAAA\t20071212\n", # oops !
     "bbb\tAAA\t...\t...\t...\tAAA\t20070101\n",
     "bbb\tAAA\t...\t...\t...\tAAA\t20070101\n",
     "bbb\tAAA\t...\t...\t...\tAAA\t20071212\n",
     "bbb\tAAA\t...\t...\t...\tAAA\t20070612\n",
     "bbb\tAAA\t...\t...\t...\tBBB\t20071212\n",
     ]

=>
aaa	AAA	...	...	...	BBB	20071212
bbb	AAA	...	...	...	BBB	20071212
ccc	AAA	...	...	...	BBB	20071201

At least for piid 'ccc', there's a record with matching state and newest 
date. (exact same result as with your original version).

HTH