[Tutor] Increase performance of the script

Peter Otten __peter__ at web.de
Sun Dec 9 15:17:53 EST 2018


Asad wrote:

> Hi All ,
> 
>           I have the following code to search for an error and prin the
> solution .
> 
> /A/B/file1.log size may vary from 5MB -5 GB
> 
> f4 = open (r" /A/B/file1.log  ", 'r' )
> string2=f4.readlines()

Do not read the complete file into memory. Read one line at a time and keep 
only those lines around that you may have to look at again.

> for i in range(len(string2)):
>     position=i
>     lastposition =position+1
>     while True:
>          if re.search('Calling rdbms/admin',string2[lastposition]):
>           break
>          elif lastposition==len(string2)-1:
>           break
>          else:
>           lastposition += 1

You are trying to find a group of lines. The way you do it for a file of the 
structure

foo
bar
baz
end-of-group-1
ham
spam
end-of-group-2

you find the groups

foo
bar
baz
end-of-group-1

bar
baz
end-of-group-1

baz
end-of-group-1

ham
spam
end-of-group-2

spam
end-of-group-2

That looks like a lot of redundancy which you can probably avoid. But 
wait...


>     errorcheck=string2[position:lastposition]
>     for i in range ( len ( errorcheck ) ):
>         if re.search ( r'"error(.)*13?"', errorcheck[i] ):
>             print "Reason of error \n", errorcheck[i]
>             print "script \n" , string2[position]
>             print "block of code \n"
>             print errorcheck[i-3]
>             print errorcheck[i-2]
>             print errorcheck[i-1]
>             print errorcheck[i]
>             print "Solution :\n"
>             print "Verify the list of objects belonging to Database "
>             break
>     else:
>         continue
>     break

you throw away almost all the hard work to look for the line containing 
those four lines? It looks like you only need the 
"error...13" lines, the three lines that precede it and the last 
"Calling..." line occuring before the "error...13".

> The problem I am facing in performance issue it takes some minutes to
> print out the solution . Please advice if there can be performance
> enhancements to this script .

If you want to learn the Python way you should try hard to write your 
scripts without a single

for i in range(...):
    ...

loop. This style is usually the last resort, it may work for small datasets, 
but as soon as you have to deal with large files performance dives.
Even worse, these loops tend to make your code hard to debug.

Below is a suggestion for an implementation of what your code seems to be 
doing that only remembers the four recent lines and works with a single 
loop. If that saves you some time use that time to clean the scripts you 
have lying around from occurences of "for i in range(....): ..." ;)


from __future__ import print_function

import re
import sys
from collections import deque


def show(prompt, *values):
    print(prompt)
    for value in values:
        print(" {}".format(value.rstrip("\n")))


def process(filename):
    tail = deque(maxlen=4)  # the last four lines
    script = None
    with open(filename) as instream:
        for line in instream:
            tail.append(line)
            if "Calling rdbms/admin" in line:
                script = line
            elif re.search('"error(.)*13?"', line) is not None:
                show("Reason of error:", tail[-1])
                show("Script:", script)
                show("Block of code:", *tail)
                show(
                    "Solution",
                    "Verify the list of objects belonging to Database"
                )
                break


if __name__ == "__main__":
    filename = sys.argv[1]
    process(filename)




More information about the Tutor mailing list