[Fwd: Re: [Tutor] searching for data in one file from another]

Kent Johnson kent_johnson at skillsoft.com
Fri Nov 5 15:39:00 CET 2004


Rich,

When you read f2 with readlines, the newlines are included in the lines. So 
you will never get a match with the exon from f1. Also since you are 
apparently doing many tests for membership in list, a set would probably be 
faster. I suggest you try something like this to create 'list':

from sets import Set
list = Set()
for line in open(exons_to_delete):
   list.add(line.strip())

The rest of the program stays the same, including the test 'if exon in list'

You might want to use a different name for 'list' though.

Kent

At 09:16 AM 11/5/2004 -0500, Rich Krauter wrote:
>import sys,string
>WFILE=open(sys.argv[1], 'w')
>def 
>deleteExons(fname2='Z:/datasets/altsplice1.fasta',exons_to_delete='Z:/datasets/Exonlist.txt'):
>     f = open(fname2)
>     f2 = open(exons_to_delete)
>     list = f2.readlines()
>     exon = None
>     for line in f:
>         if line.startswith('>'):
>             exon = line[1:].split('|')[0]
>         if exon in list:
>             continue
>         yield line
>
>
>if __name__ == '__main__':
>         for line in deleteExons():
>                 print >> WFILE, line,
>
>exonlist is made from the last program you helped me with and consists
>of single lines of exons
>
>altsplice1.fasta is 85583 kb
>when I run the program it does not shrink the file at all, in fact
>althought the first and last 40 lines appear to be the same, the
>output file is larger than the original.
>
>It is a normal fast file:
>
>
>ENSE00001383339.1|ENSG00000187908.1|ENST00000339871.1 
>assembly=NCBI34|chr=10_NT
>
>_078087|strand=forward|bases 57203 to 57283|exons plus upstream and 
>downstream r
>egions for exon
>ACCCAGCAAAATGGGGATCTCCACAGTCATCCTTGAAATGTGTCTTTTATGGGGACAAGTTCTATCTACAGGTATTACGT
>T
>
>
>ENSE00001387275.1|ENSG00000187908.1|ENST00000339871.1 
>assembly=NCBI34|chr=10_NT
>
>_078087|strand=forward|bases 72877 to 72981|exons plus upstream and 
>downstream r
>egions for exon
>GAGATGGCAGGTGTCAGGGCCGAGTGGAGATCCTATACCGAGGCTCCTGGGGCACCGTGTGTGATGACAGCTGGGACACC
>AATGATGCCAACGTGGTCTGTAGGC
>
>
>ENSE00001378578.1|ENSG00000187908.1|ENST00000339871.1 
>assembly=NCBI34|chr=10_NT
>
>_078087|strand=forward|bases 82505 to 82835|exons plus upstream and 
>downstream r
>egions for exon
>CTGAATCCAGTTTGGCCCTGAGGCTGGTGAATGGAGGTGACAGGTGTCAGGGCCGAGTGGAGGTCCTATACCGAGGCTCC
>TGGGGCACCGTGTGTGATGACAGCTGGGACACCAATGATGCCAATGTGGTCTGCAGGCAGCTGGGCTGTGGCTGGGCCAT
>GTTGGCCCCAGGAAATGCCCGGTTTGGTCAGGGCTCAGGACCCATTGTCCTGGATGACGTGCGCTGCTCAGGGAATGAGT
>CCTACTTGTGGAGCTGCCCCCACAATGGCTGGCTCTCCCATAACTGTGGCCATAGTGAAGACGCTGGTGTCATCTGCTCA
>GGTGGGCCTCC
>
>
>ENSE00001379544.1|ENSG00000187908.1|ENST00000339871.1 
>assembly=NCBI34|chr=10_NT
>
>_078087|strand=forward|bases 88623 to 89087|exons plus upstream and 
>downstream r
>egions for exon
>
>Any thoughts?
>
>Scott
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list