Question of Python second loop break and indexes

Wed May 9 05:52:33 EDT 2012

Am 09.05.2012 10:36, schrieb lilin Yi:
> //final_1 is a list of Identifier which I need to find corresponding
> files(four lines) in x(x is the  file) and write following four lines
> in a new file.
> 
> //because the order of the identifier is the same, so after I find the
> same identifier in x , the next time I want to start from next index
> in x,which will save time. That is to say , when the if command
> satisfied ,it can automatically jump out out the second while loop and
> come to the next identifier of final_1 ,meanwhile the j should start
> not from the beginning but the position previous.
>
> //when I run the code it takes too much time more than one hour and
> give the wrong result....so could you help me make some improvement of
> the code?

If the code takes too much time and gives the wrong results, you must
fix and improve it. In order to do that, the first thing you should do
is get familiar with "test-driven development" and Python's unittest
library. You can start by fixing the code, but chances are that you will
break it again trying to make it fast then. Having tests that tell you
after each step that the code still works correctly is invaluable.

Some more comments below...

> i=0
> 
> offset_1=0
> 
> 
> while i <len(final_1):
> 	j = offset_1
> 	while j <len(x1):
> 		if final_1[i] == x1[j]:
> 			new_file.write(x1[j])
> 			new_file.write(x1[j+1])
> 			new_file.write(x1[j+2])
> 			new_file.write(x1[j+3])
> 			offset_1 = j+4
> 			quit_loop="True"
> 		if quit_loop == "True":break
> 		else: j=j +1
> 	i=i+1

Just looking at the code, there are a few things to note:
1. You are iterating "i" from zero to len(final_1)-1. The pythonic way
to code this is using "for i in range(len(final_1)):...". However, since
you only use the index "i" to look up an element inside the "final_1"
sequence, the proper way is "for f in final_1:..."
2. Instead of writing four lines separately, you could write them in a
loop: "for x in x1[j:j+4]: new_file.write(x)".
3. "x1" is a list, right? In that case, there is a member function
"index()" that searches for an element and accepts an optional start
position.
4. The "quit_loop" is useless, and you probably are getting wrong
results because you don't reset this value. If you use "break" at the
place where you assign "True" to it, you will probably get what you
want. Also, Python has real boolean variables with the two values "True"
and "False", you don't have to use strings.

Concerning the speed, you can probably improve it by not storing the
lines of the input file in "x1", but rather creating a dictionary
mapping between the input value and the according four lines:

content = open(...).readlines()
d = {}
for i in range(0, len(content), 4):
    d[content[i]] = tuple(content[i, i+4])

Then, drop the "offset_1" (at least do that until you have the code
working correctly), as it doesn't work with a dictionary and the
dictionary will probably be faster anyway.

The whole loop above then becomes:

for idf in final_1:
    for l in d.get(idf):
        new_file.write(l)

;)

I hope I gave you a few ideas, good luck!

Uli