Perl and Python, a practical side-by-side example.
Bruno Desthuilliers
bdesth.quelquechose at free.quelquepart.fr
Fri Mar 2 20:36:19 EST 2007
Shawn Milo a écrit :
> I'm new to Python and fairly experienced in Perl, although that
> experience is limited to the things I use daily.
>
> I wrote the same script in both Perl and Python, and the output is
> identical. The run speed is similar (very fast) and the line count is
> similar.
>
> Now that they're both working, I was looking at the code and wondering
> what Perl-specific and Python-specific improvements to the code would
> look like, as judged by others more knowledgeable in the individual
> languages.
>
> I am not looking for the smallest number of lines, or anything else
> that would make the code more difficult to read in six months. Just
> any instances where I'm doing something inefficiently or in a "bad"
> way.
>
(snip)
>
> #! /usr/bin/env python
>
> import sys
>
> input = sys.stdin
>
> recs = {}
>
> for row in input:
> row = row.rstrip('\n')
> piid = row.split('\t')[0]
> if recs.has_key(piid) is False:
'is' is the identity operator - practically, in CPython, it compares
memory addresses. You *dont* want to use it here.
Another way to test if a key is already in a dict is to use the 'in'
operator, ie:
if not piid in recs:
> recs[piid] = []
> recs[piid].append(row)
As a last point, dicts have a setdefault(key, default) method that
returns the value associated with key if it already exists, else add the
key=>default pair and returns default, but it's a little bit slower.
for row in input:
row = row.rstrip('\n')
piid = row.split('\t')[0]
recs.setdefault(piid, []).append(row)
or
for row in input:
row = row.rstrip('\n')
piid = row.split('\t')[0]
if piid not in recs:
recs[piid] = []
recs[piid].append(row)
> for piid in recs.keys():
You don't need to call key() here - it's already the default iteration
for dicts.
But since you want the values too, you should use dict.items() or
dict.iteritems():
for piid, rows in recs.iteritems()
> best = ""
> for current in recs[piid]:
> if best == "":
> best = current;
Get rid of this ";" !-)
Better to use None - it's the "no value" object, and it let you use an
identity test:
best = None
for current in rows:
if best is None:
best = row
And while we're at it, you can save yourself a test here:
best = row[0]
for current in row[1:]:
# start tests
> #If the current record is the correct state
> if current.split("\t")[1] == current.split("\t")[6]:
> #If the existing record is the correct state
> if best.split("\t")[1] == best.split("\t")[6]:
> #If the new record has a newer exp. date
> if current.split("\t")[5] > best.split("\t")[5]:
You're repeatingly calling split() on the same objects. This is a
serious waste of time and CPU.
> best = current
> else:
> best = current
> else:
> #If the existing record does not have the correct state
> #and the new record has a newer exp. date
> if best.split("\t")[1] != best.split("\t")[6] and
> current.split("\t")[5] > best.split("\t")[5]:
> best = current
>
> print best
Here's a somewhat corrected version:
# ---
import sys
def findbests(input=sys.stdin, output=sys.stdout):
DATE = 5
TARGET = 6
STATE = 1
recs = {}
for row in input:
row = row.rstrip('\n').split("\t")
piid = row[0]
if piid not in recs:
recs[piid] = []
recs[piid].append(row)
for piid, rows in recs.iteritems():
best = rows[0]
for current in rows[1:]:
if current[STATE] == current[TARGET]:
if best[STATE] == best[TARGET]:
if current[DATE] > best[DATE]:
best = current
else:
best = current
elif best[STATE] != best[TARGET] \
and current[DATE] > best[DATE]:
best = current
print >> output, "\t".join(best)
if __name__ == '__main__':
findbdest()
# ---
It's somewhat shorter, a bit more readable IMHO, and somewhat faster too
(=~ 30/40% faster on my machine). Also, it's usable as both a program
and an importable module, and with any line input and file-like output.
Now for the bad news: I'm afraid your algorithm is broken : here are my
test data and results:
input = [
#ID STATE ... ... ... TARG DATE
"aaa\tAAA\t...\t...\t...\tBBB\t20071212\n",
"aaa\tAAA\t...\t...\t...\tAAA\t20070120\n",
"aaa\tAAA\t...\t...\t...\tAAA\t20070101\n",
"aaa\tAAA\t...\t...\t...\tBBB\t20071010\n",
"aaa\tAAA\t...\t...\t...\tBBB\t20071111\n",
"ccc\tAAA\t...\t...\t...\tBBB\t20071201\n",
"ccc\tAAA\t...\t...\t...\tAAA\t20070101\n",
"ccc\tAAA\t...\t...\t...\tBBB\t20071212\n",
"ccc\tAAA\t...\t...\t...\tAAA\t20071212\n", # oops !
"bbb\tAAA\t...\t...\t...\tAAA\t20070101\n",
"bbb\tAAA\t...\t...\t...\tAAA\t20070101\n",
"bbb\tAAA\t...\t...\t...\tAAA\t20071212\n",
"bbb\tAAA\t...\t...\t...\tAAA\t20070612\n",
"bbb\tAAA\t...\t...\t...\tBBB\t20071212\n",
]
=>
aaa AAA ... ... ... BBB 20071212
bbb AAA ... ... ... BBB 20071212
ccc AAA ... ... ... BBB 20071201
At least for piid 'ccc', there's a record with matching state and newest
date. (exact same result as with your original version).
HTH
More information about the Python-list
mailing list