[Tutor] searching for data in one file from another

Thu Oct 28 18:23:45 CEST 2004

Hello!

First, thanks to Rich for the great help last time I wrote in to the list.

I have a file with exon ids, one per line and another large file with
exon, gene and transcript info.

I would like to remove each entry in the second (ID information and
subsequent sequence data) which matches from the first, but I keep
running into problems just getting the matching to work.

# format of file to remove exons from is:

 >ENSE00001387275.1|ENSG00000187908.1|ENST00000339871.1
assembly=NCBI34|chr=10_NT_078087|strand=forward|bases 72877 to
72981|exons plus upstream and downstream regions for exon

GAGATGGCAGGTGTCAGGGCCGAGTGGAGATCCTATACCGAGGCTCCTGGGGCACCGTGTGTGATGACAGCTGGGACACCAATGATGCCAACGTGGTCTGTAGGC

in the above there is a newline after "regions for exon" then blank
line the data after
that is all one long line followed by blank space

Here is the basics of my script so far:
################################################
import re, sys, string

info=re.compile('^>(ENSE\d+\.\d).+')
info2=re.compile('(ENSE\d+\.\d)')

RFILE=open(sys.argv[1], 'r')			#list of redundant exons
R2FILE=open(sys.argv[2], 'r')		       #full list of genes,transcripts and exons
#WFILE=open(sys.argv[3], 'w')		     #write above file minus redundant
exons not ready for this yet

m=0
Rexons=0

for line in RFILE:				      #cycle over the list of exons
	Rexons=Rexons+1				#counter for number of exons
	Matched1= info2.match(line)	       #test line matches format -each line should
	Ecount=0					#counter for how many occurances of this exon there are
in big file

	if Matched1:
		RiD=Matched1.group(1)		#assign the exon to RiD (Redundant iD)
		print "\n",RiD, "this is the ",Rexons," exon to check against" #just
to watch while testing
		line2Count=0				#line counter for big file
		for line2 in R2FILE:			#iterate through the big file
			line2Count=line2Count+1
			Matched= info.match(line2)	
			if Matched:
				ID=Matched.group(1)   	#ID is now the grouping from above-should
be the exon id
				#print ID,
				if ID==RiD:
					Ecount=Ecount+1
					m=m+1
					print ID, "from line", line2Count
			print" checked total of ", line2Count," lines."
	print "There were ",Ecount," hits for this exon."

print Rexons, "redundant exons in list"
print m, "exons removed from large file"

################################################
when I run this I get 

Z:\datasets>C:\scomel\python2.3.4\python.exe ..\scripts\r
3pm.txt altsplice1.fasta

ENSE00000677348.1 this is the  1  exon to check against
ENSE00000677348.1 from line 188560
ENSE00000677348.1 from line 188656
 checked total of  1156852  lines.
There were  2  hits for this exon.

ENSE00000677356.1 this is the  2  exon to check against
 checked total of  0  lines.
There were  0  hits for this exon.

ENSE00000677362.1 this is the  3  exon to check against
 checked total of  0  lines.
There were  0  hits for this exon.

ENSE00000677344.1 this is the  4  exon to check against
 checked total of  0  lines.
There were  0  hits for this exon.

Obviously I have made one or more errors in my iterations as it prints
the line checked total of 0 lines after going through the first loop
of first file.
Each of the exons in the list RFILE is only there because it occurs in
each version of the genes in the second list so there should be hits
for each.

When it starts to run there is a pause of 7-10 seconds after printing
ENSE00000677348.1 this is the  1  exon to check against
Then the everything cycles past as fast as can be written to screen,
and finds no matches.

I am stumped.

Thanks in advance for all help,
Scott
-- 
Scott Melnyk

melnyk at gmail.com