[Tutor] nested lists of data

Mon Oct 18 13:31:34 CEST 2004

Hello!

I am looking at data in the form of  Type, subgroup, data
it is genetic data so I will try to make this a generic question but
includ ethe real data so someone can help explain what I have done
wrong or a better way to go about this.

Type 1 may have 1 or more subgroups, each subgroub must have at least
one piece of data

format is :
(although it looks here like the exons on on separate lines, in the
data file the new line occurs only after NEW GENE, NEW TRANSCRIPT and
the last exon in a transcript (and thewhite space line is just
newlline)
type                         subgroup                    data
ENSG is gene id       ENST is transcript id   ENSE the exons       

NEW GENE
ENSG00000187908.1 ENST00000339871.1 ENSE00001383339.1
ENSE00001387275.1 ENSE00001378578.1 ENSE00001379544.1
ENSE00001368222.1 ENSE00001372264.1 ENSE00001365999.1

	NEW TRANSCRIPT
ENSG00000187908.1 ENST00000344399.1 ENSE00001384814.1
ENSE00001374811.1 ENSE00001391015.1 ENSE00001370692.1
ENSE00001372884.1 ENSE00001386551.1 ENSE00001386137.1

	NEW TRANSCRIPT
ENSG00000187908.1 ENST00000338354.1 ENSE00001364942.1
ENSE00001379878.1 ENSE00001376065.1 ENSE00001379576.1

NEW GENE
ENSG00000129899.5 ENST00000306922.4 ENSE00001350558.2
ENSE00001316817.3 ENSE00001149607.1 ENSE00001149600.1
ENSE00001149591.1 ENSE00001149579.1 ENSE00001383071.1
ENSE00001149558.2 ENSE00001149547.1 ENSE00001302825.1
ENSE00001149539.1 ENSE00001149529.1 ENSE00001186785.1
ENSE00001350475.1 ENSE00001350469.2 ENSE00001350465.2
ENSE00001350461.2 ENSE00001350458.2 ENSE00001309288.1
ENSE00001149467.2 ENSE00001250660.4

	NEW TRANSCRIPT
ENSG00000129899.5 ENST00000306944.5 ENSE00001350584.2
ENSE00001316817.3 ENSE00001149607.1 ENSE00001149600.1
ENSE00001149591.1 ENSE00001149579.1 ENSE00001383071.1
ENSE00001149558.2 ENSE00001149547.1 ENSE00001302825.1
ENSE00001149539.1 ENSE00001149529.1 ENSE00001186785.1

I would like to use sets to generate a txt file of  any  data in the
transcripts that is repeated between each transcript. So for a given
gene each if each transcript has exons x and y then I want to know
that.
In another way, within a group (GENE) if all subgroups(transcripts)
contain exon(s)  x(y,z, or more)  then write to a file that contain
gene id then all exons that are in all transcripts for gene x
ENSG00000129899.5   ENSE00001149539.1 ENSE00001149579.1

the output file is equivilent to 
group x data 3 data4
group x data 7 data 9 data 10
etc.

here is what i am trying-the print statements are for checking the
program as I go along
they will be removed

info=re.compile('^(ENSG\d+\.\d).+(ENST\d+\.\d).+(ENSE\d+\.\d)+')
#above is match gene, transcript, then one or more exons

exonArray=[]
geneflag=0	
transArray=[]
AllTrans=[]		

 for line in TFILE:
	Matched2= info.match(line)	
	if line.startswith('NEW GENE'):
		geneflag=geneflag+1
		transloop=0					
	if Matched2:
		if line.startswith('ENS'): 
			geneid,transid,exons=line.split(None,2)
			exonArray=exons.split()							
			print "this is the gene "+geneid
			print "this is the transcript "+transid
			print "these are the exons:  \n"
			for exon in exonArray:
				print exon ," ",
			print "\n"
                         print transloop
                        #up to here seems to be working fine by the output

			#problems here
			transArray[transloop]=exonArray
			#transArray[0]=exonArray tried this same error
			transloop=transloop+1
			AllTrans[geneflag]=transArray[transloop]
		if not line.startswith('ENS'):
			break

when run I get:

Z:\datasets>C:\scomel\python2.3.4\python.exe Z:\scripts\MondayExonRemoval.py Mo
dayTest.txt MonOct18spam.txt
this is the gene ENSG00000187908.1
this is the transcript ENST00000339871.1
these are the exons:

ENSE00001383339.1   ENSE00001387275.1   ENSE00001378578.1   ENSE00001379544.1
ENSE00001368222.1   ENSE00001372264.1   ENSE00001365999.1   ENSE00001377564.1
ENSE00001382923.1   ENSE00001366872.1   ENSE00001372652.1   ENSE00001374822.1
ENSE00001390913.1   ENSE00001386215.1   ENSE00001378373.1   ENSE00001389805.1
ENSE00001367196.1   ENSE00001377652.1   ENSE00001375990.1   ENSE00001386225.1

0
Traceback (most recent call last):
  File "Z:\scripts\MondayExonRemoval.py", line 58, in ?
    transArray[transloop]=exonArray
IndexError: list assignment index out of range 

I was thinking it would be an set of nested lists
AlllTrans [0] would be the first gene group which is set to the geneflag number
and the group is transArray (list of transcripts) each of which has
the list of exons

these nested lists are making me a bit dizzy but I am not seeing a
clearer way to go on.

I was going to convert each list into a set and use the sets to pull
out what was the same within each.

Any suggestions and help appreciated

-- 
Scott Melnyk

melnyk at gmail.com