[Tutor] Problem When Iterating Over Large Test Files

Thu Jul 19 08:00:12 CEST 2012

On Wed, Jul 18, 2012 at 04:33:20PM -0700, Ryan Waples wrote:

> I've included 20 consecutive lines of input and output.  Each of these
> 5 'records' should have been selected and printed to the output file.

I count only 19 lines. The first group has only three lines. See below.

There is a blank line, which I take as NOT part of the input but just a 
spacer. Then:

1) Line starting with @
2) Line of bases CGCGT ...
3) Plus sign
4) Line starting with @@@
5) Line starting with @
6) Line of bases TTCTA ...
7) Plus sign

and so on. There are TWO lines before the first +, and three before each 
of the others.

> __EXAMPLE RAW DATA FILE REGION__
> 
> @HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0:
> CGCGTGTGCAGGTTTATAGAACAAAACAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC
> +
> @@@DDADDHHHHHB9+2A<??:?G9+C)???G at DB@@DGFB<0*?FF?0F:@/54'-;;?B;>;6>>>>(5 at CDAC(5(5:5,(8?88?BC@#########
> @HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0:
> TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA
> +
> @CCFFFDFHHHHHIIIIJJIJHHIIIJHGHIJI at GFFDDDFDDCEEEDCCBDCCCDDDDCCB>>@C(4 at ADCA>>?BBBDDABB055<>-?A<B1:@ACC:
> @HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0:
> CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCACCCCAGTAAATATGTA
> +
> CCCFFFFFHHHHHIJIEHIH at AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE at AACC<ACDB;;B?C3>A>AD>BA
> @HWI-ST0747:167:B02DEACXX:8:1101:3022:167094 1:N:0:
> ATTCCGTGCAGGCCAACTCCCGACGGACATCCTTGCTCAGACTGCAGCGATAGTGGTCGATCAGGGCCCTGTTGTTCCATCCCACTCCGGCGACCAGGTTC
> +
> CCCFFFFFHHHHHIDHJIIHIIIJIJIIJJJJGGIIFHJIIGGGGIIEIFHFF>CBAECBDDDC:??B=AAACD?8@:>C@?8CBDDD at D99B@>3884>A
> @HWI-ST0747:167:B02DEACXX:8:1101:3095:167100 1:N:0:
> CGTGATTGCAGGGACGTTACAGAGACGTTACAGGGATGTTACAGGGACGTTACAGAGACGTTAAAGAGATGTTACAGGGATGTTACAGACAGAGACGTTAC
> +

Your code says that the first line in each group should start with an @ 
sign. That is clearly not the case for the last two groups.

I suggest that your data files have been corrupted.

> __PYTHON CODE __

I have re-written your code slightly, to be a little closer to "best 
practice", or at least modern practice. If there is anything you don't 
understand, please feel free to ask.

I haven't tested this code, but it should run fine on Python 2.7.

It will be interesting to see if you get different results with this.

import glob

def four_lines(file_object):
        """Yield lines from file_object grouped into batches of four.

        If the file has fewer than four lines remaining, pad the batch 
        with 1-3 empty strings.

        Lines are stripped of leading and trailing whitespace.
        """
        while True:
            # Get the first line. If there is no first line, we are at EOF
            # and we raise StopIteration to indicate we are done.
            line1 = next(file_object).strip()
            # Get the next three lines, padding if needed.
            line2 = next(file_object, '').strip()
            line3 = next(file_object, '').strip()
            line4 = next(file_object, '').strip()
            yield (line1, line2, line3, line4)

my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq')
for each in my_in_files:
        out = each.replace('/gzip', '/rem_clusters2' )
        print ("Reading File: " + each)
        print ("Writing File: " + out)
        INFILE = open (each, 'r')
        OUTFILE = open (out , 'w')
        writes = 0

        for reads, lines in four_lines( INFILE ):
                ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines
                # Check that ID_Line_1 starts with @
                if not ID_Line_1.startswith('@'):
                        print ("**ERROR**")
                        print ("expected ID_Line to start with @")
                        print (lines)
                        print ("Read Number " + str(Reads))
                        break
                elif Quality_Line != '+':
                        print ("**ERROR**")
                        print ("expected Quality_Line = +")
                        print (lines)
                        print ("Read Number " + str(Reads))
                        break
                # Select Reads that I want to keep      
                ID = ID_Line_1.partition(' ')
                if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"):
                        # Write to file, maintaining group of 4
                        OUTFILE.write(ID_Line_1 + "\n")
                        OUTFILE.write(Seq_Line + "\n")
                        OUTFILE.write(ID_Line_2 + "\n")
                        OUTFILE.write(Quality_Line + "\n")
                        writes += 1
        # End of file reached, print update
        print ("Saw", reads, "groups of four lines")
        print ("Wrote", writes, "groups of four lines")
        INFILE.close()
        OUTFILE.close()

-- 
Steven