[Tutor] Code optmisation

Sat Apr 5 09:44:00 CEST 2008

"yogi" <byogi at yahoo.com> wrote

> #/bin/python
> import sys, os, csv, re
> x =  0                                  #Define Zero for now
> var = 1000000                           #Taking the variation
> # This programme finds the SNPs from the range passed
> # csv splits columns and this file is tab spaced
> fis = csv.reader(open("divs.map", "rb"), delimiter='\t', 
> quoting=csv.QUOTE_NONE)
> for row in fis:
> # csv splits columns and this file is ","  spaced
>        gvalues = csv.reader(open("genvalues", "rb"), delimiter=',', 
> quoting=csv.QUOTE_NONE)

Move this outside the loop otherwise you re-read the file
for every line in the other file - slow!

>        for gvalue in gvalues:
> # To see  Columns (chr) Match
>                if row[0] == gvalue[0]:
> # If  Column 3  (range) is Zero  print row
>                        if int(gvalue[3]) ==  x:
>                                a = int(gvalue[1]) - var
>                                b = int(gvalue[2]) + var + 1
>                                if int(a <= int(row[3]) <= b):
>                                        print   row

I'd probably use names like 'lo' and 'hi' instead of 'a'
and 'b' but thats a nit pick... but you don't want to convert
the result of the test to an int, the result is a boolean and
you never use the int you create so its just wasted
processing power...

> # If  Column 3  (range) is not zero find matches and print row
>                        else:
>                                a = int(gvalue[1]) - var
>                                b = int(gvalue[2]) + var + 1

Repeated code, you could move this above the if test
since its used by both conditions. Easier to maintain if
you change the rules...

>                                if int(a <= int(row[3]) <= b):

>                                        print row
again you don;t need the int() conversion.

>                                        c = int(gvalue[3]) - var
>                                        d = int(gvalue[4]) + var + 1
>                                        if int(c <= int(row[3]) <= 
> d):

and again. You do this so often I'd consider making it a
helper function

def inLimits(min, max, val):
     lo = int(min) - var
     hi = int(max) + var + 1
     return lo <= int(val) <= hi

Your else clause then becomes

        else:
            if inLmits(gvalue[1],gvalue[2],row[3])
               print row
               if inLimits(gvalue[3], gvalue[4], row[3]
                   print row

Which is slightly more readable I think.

> Question1 : Is there a better way ?

There's always a better way.
As a general rule for processing large volumes of data
I tend to go for a SQL database. But thats mainly based
on my experience that if you want to do one lot of queries
you'll eventually want to do more - and SQL is designed
for doing queries on large datasets, Python isn't (although
Python can do SQL...).

> Question2 : For now I'm using shells time  call  for calculating
> time required. Does Python provide a more fine grained check.

try timeit...

> Question 2: If I have convert this code into a function.
> Should I ?

Only if you have a need to reuse it in a bigger context
or of you want to parameterize it. You could maybe break
it out into smaller helper functions such as the one I
suggested above.

HTH,

-- 
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld