[Tutor] searching for data in one file from another
Scott Melnyk
melnyk at gmail.com
Thu Oct 28 18:23:45 CEST 2004
Hello!
First, thanks to Rich for the great help last time I wrote in to the list.
I have a file with exon ids, one per line and another large file with
exon, gene and transcript info.
I would like to remove each entry in the second (ID information and
subsequent sequence data) which matches from the first, but I keep
running into problems just getting the matching to work.
# format of file to remove exons from is:
>ENSE00001387275.1|ENSG00000187908.1|ENST00000339871.1
assembly=NCBI34|chr=10_NT_078087|strand=forward|bases 72877 to
72981|exons plus upstream and downstream regions for exon
GAGATGGCAGGTGTCAGGGCCGAGTGGAGATCCTATACCGAGGCTCCTGGGGCACCGTGTGTGATGACAGCTGGGACACCAATGATGCCAACGTGGTCTGTAGGC
in the above there is a newline after "regions for exon" then blank
line the data after
that is all one long line followed by blank space
Here is the basics of my script so far:
################################################
import re, sys, string
info=re.compile('^>(ENSE\d+\.\d).+')
info2=re.compile('(ENSE\d+\.\d)')
RFILE=open(sys.argv[1], 'r') #list of redundant exons
R2FILE=open(sys.argv[2], 'r') #full list of genes,transcripts and exons
#WFILE=open(sys.argv[3], 'w') #write above file minus redundant
exons not ready for this yet
m=0
Rexons=0
for line in RFILE: #cycle over the list of exons
Rexons=Rexons+1 #counter for number of exons
Matched1= info2.match(line) #test line matches format -each line should
Ecount=0 #counter for how many occurances of this exon there are
in big file
if Matched1:
RiD=Matched1.group(1) #assign the exon to RiD (Redundant iD)
print "\n",RiD, "this is the ",Rexons," exon to check against" #just
to watch while testing
line2Count=0 #line counter for big file
for line2 in R2FILE: #iterate through the big file
line2Count=line2Count+1
Matched= info.match(line2)
if Matched:
ID=Matched.group(1) #ID is now the grouping from above-should
be the exon id
#print ID,
if ID==RiD:
Ecount=Ecount+1
m=m+1
print ID, "from line", line2Count
print" checked total of ", line2Count," lines."
print "There were ",Ecount," hits for this exon."
print Rexons, "redundant exons in list"
print m, "exons removed from large file"
################################################
when I run this I get
Z:\datasets>C:\scomel\python2.3.4\python.exe ..\scripts\r
3pm.txt altsplice1.fasta
ENSE00000677348.1 this is the 1 exon to check against
ENSE00000677348.1 from line 188560
ENSE00000677348.1 from line 188656
checked total of 1156852 lines.
There were 2 hits for this exon.
ENSE00000677356.1 this is the 2 exon to check against
checked total of 0 lines.
There were 0 hits for this exon.
ENSE00000677362.1 this is the 3 exon to check against
checked total of 0 lines.
There were 0 hits for this exon.
ENSE00000677344.1 this is the 4 exon to check against
checked total of 0 lines.
There were 0 hits for this exon.
Obviously I have made one or more errors in my iterations as it prints
the line checked total of 0 lines after going through the first loop
of first file.
Each of the exons in the list RFILE is only there because it occurs in
each version of the genes in the second list so there should be hits
for each.
When it starts to run there is a pause of 7-10 seconds after printing
ENSE00000677348.1 this is the 1 exon to check against
Then the everything cycles past as fast as can be written to screen,
and finds no matches.
I am stumped.
Thanks in advance for all help,
Scott
--
Scott Melnyk
melnyk at gmail.com
More information about the Tutor
mailing list