[Tutor] Problem When Iterating Over Large Test Files
Steven D'Aprano
steve at pearwood.info
Thu Jul 19 08:00:12 CEST 2012
On Wed, Jul 18, 2012 at 04:33:20PM -0700, Ryan Waples wrote:
> I've included 20 consecutive lines of input and output. Each of these
> 5 'records' should have been selected and printed to the output file.
I count only 19 lines. The first group has only three lines. See below.
There is a blank line, which I take as NOT part of the input but just a
spacer. Then:
1) Line starting with @
2) Line of bases CGCGT ...
3) Plus sign
4) Line starting with @@@
5) Line starting with @
6) Line of bases TTCTA ...
7) Plus sign
and so on. There are TWO lines before the first +, and three before each
of the others.
> __EXAMPLE RAW DATA FILE REGION__
>
> @HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0:
> CGCGTGTGCAGGTTTATAGAACAAAACAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC
> +
> @@@DDADDHHHHHB9+2A<??:?G9+C)???G at DB@@DGFB<0*?FF?0F:@/54'-;;?B;>;6>>>>(5 at CDAC(5(5:5,(8?88?BC@#########
> @HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0:
> TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA
> +
> @CCFFFDFHHHHHIIIIJJIJHHIIIJHGHIJI at GFFDDDFDDCEEEDCCBDCCCDDDDCCB>>@C(4 at ADCA>>?BBBDDABB055<>-?A<B1:@ACC:
> @HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0:
> CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCACCCCAGTAAATATGTA
> +
> CCCFFFFFHHHHHIJIEHIH at AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE at AACC<ACDB;;B?C3>A>AD>BA
> @HWI-ST0747:167:B02DEACXX:8:1101:3022:167094 1:N:0:
> ATTCCGTGCAGGCCAACTCCCGACGGACATCCTTGCTCAGACTGCAGCGATAGTGGTCGATCAGGGCCCTGTTGTTCCATCCCACTCCGGCGACCAGGTTC
> +
> CCCFFFFFHHHHHIDHJIIHIIIJIJIIJJJJGGIIFHJIIGGGGIIEIFHFF>CBAECBDDDC:??B=AAACD?8@:>C@?8CBDDD at D99B@>3884>A
> @HWI-ST0747:167:B02DEACXX:8:1101:3095:167100 1:N:0:
> CGTGATTGCAGGGACGTTACAGAGACGTTACAGGGATGTTACAGGGACGTTACAGAGACGTTAAAGAGATGTTACAGGGATGTTACAGACAGAGACGTTAC
> +
Your code says that the first line in each group should start with an @
sign. That is clearly not the case for the last two groups.
I suggest that your data files have been corrupted.
> __PYTHON CODE __
I have re-written your code slightly, to be a little closer to "best
practice", or at least modern practice. If there is anything you don't
understand, please feel free to ask.
I haven't tested this code, but it should run fine on Python 2.7.
It will be interesting to see if you get different results with this.
import glob
def four_lines(file_object):
"""Yield lines from file_object grouped into batches of four.
If the file has fewer than four lines remaining, pad the batch
with 1-3 empty strings.
Lines are stripped of leading and trailing whitespace.
"""
while True:
# Get the first line. If there is no first line, we are at EOF
# and we raise StopIteration to indicate we are done.
line1 = next(file_object).strip()
# Get the next three lines, padding if needed.
line2 = next(file_object, '').strip()
line3 = next(file_object, '').strip()
line4 = next(file_object, '').strip()
yield (line1, line2, line3, line4)
my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq')
for each in my_in_files:
out = each.replace('/gzip', '/rem_clusters2' )
print ("Reading File: " + each)
print ("Writing File: " + out)
INFILE = open (each, 'r')
OUTFILE = open (out , 'w')
writes = 0
for reads, lines in four_lines( INFILE ):
ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines
# Check that ID_Line_1 starts with @
if not ID_Line_1.startswith('@'):
print ("**ERROR**")
print ("expected ID_Line to start with @")
print (lines)
print ("Read Number " + str(Reads))
break
elif Quality_Line != '+':
print ("**ERROR**")
print ("expected Quality_Line = +")
print (lines)
print ("Read Number " + str(Reads))
break
# Select Reads that I want to keep
ID = ID_Line_1.partition(' ')
if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"):
# Write to file, maintaining group of 4
OUTFILE.write(ID_Line_1 + "\n")
OUTFILE.write(Seq_Line + "\n")
OUTFILE.write(ID_Line_2 + "\n")
OUTFILE.write(Quality_Line + "\n")
writes += 1
# End of file reached, print update
print ("Saw", reads, "groups of four lines")
print ("Wrote", writes, "groups of four lines")
INFILE.close()
OUTFILE.close()
--
Steven
More information about the Tutor
mailing list