remove last 76 letters from string

MRAB python at mrabarnett.plus.com
Wed Aug 5 20:32:22 EDT 2009


PeroMHC wrote:
> Hi All, So here is the problem... I have a FASTA file (used for DNA
> analyses) that looks like this:
> 
> ...
>> gnl|SRA|SRR019045.10.1 SL-XAY_956090708:2:1:0:1028.1 length=152
> NCTTTTTTTATTTTTTGTATAAATGAAGTTTCACTATATCGGACGAGCGGTTCAGCAGTCATTCCGAGAC
> CGATATAGTGAAACTTCATTTCTACAAAAANTACCAAACGTCGCTCGGCAGAGCGTCGTGTTGGGCAAGA
> GAGTAGCACTCG
>> gnl|SRA|SRR019045.11.1 SL-XAY_956090708:2:1:0:1151.1 length=152
> NGGTNTGGNNNNCNCCNTNCTNCNNCNTCANCCTCCNGTCNCANNCCNCNTNNNNNCNNNNNCNNTNCTT
> CTNCNNTCTCCATTCCTTCTTNATAGCCTGCTCCANCGCACGTTGAACCTTCTGCACCACGAACGCACTC
> ACACCACTCATC
>> gnl|SRA|SRR019045.12.1 SL-XAY_956090708:2:1:0:1197.1 length=152
> NGTCGGGTCTTCGCTATCACTGGACTGCTCCCATCAGCTATAGGTCCTCCCCGCCACACCCCATGCCCAC
> CGCCTATCCACGTCTGTCACAACCTCATACATCAGACAGTCACACTTACCAACATATCCAAGCACCTCAA
> GCAACACATCAT
> ...
> 
> This snippet represents 3 individual DNA sequences. Each sequences is
> identified by the line starting with >
> The complete file has about 10 million individual sequences.
> 
> A simple enough problem, I want to read in this data, and cut out the
> last 76 letters (nucleotides) from each individual sequence and send
> them to a new txt file with a similar format.
> 
> Any help on how to do this would be appreciated.
> Thanks!

If the input file is large then you can reduce the amount of memory
needed by reading the input file a line at a time by iterating over the
file object:

     input_file = open(input_path)
     for line in input_file:
         ...
     input_file.close()

Each line will end with '\n', so use the 'rstrip' method to remove it,
and then slice the last 76 characters:

     last_part = line.rstrip()[-76 : ]



More information about the Python-list mailing list