[Tutor] nested lists of data
Scott Melnyk
melnyk at gmail.com
Mon Oct 18 13:31:34 CEST 2004
Hello!
I am looking at data in the form of Type, subgroup, data
it is genetic data so I will try to make this a generic question but
includ ethe real data so someone can help explain what I have done
wrong or a better way to go about this.
Type 1 may have 1 or more subgroups, each subgroub must have at least
one piece of data
format is :
(although it looks here like the exons on on separate lines, in the
data file the new line occurs only after NEW GENE, NEW TRANSCRIPT and
the last exon in a transcript (and thewhite space line is just
newlline)
type subgroup data
ENSG is gene id ENST is transcript id ENSE the exons
NEW GENE
ENSG00000187908.1 ENST00000339871.1 ENSE00001383339.1
ENSE00001387275.1 ENSE00001378578.1 ENSE00001379544.1
ENSE00001368222.1 ENSE00001372264.1 ENSE00001365999.1
NEW TRANSCRIPT
ENSG00000187908.1 ENST00000344399.1 ENSE00001384814.1
ENSE00001374811.1 ENSE00001391015.1 ENSE00001370692.1
ENSE00001372884.1 ENSE00001386551.1 ENSE00001386137.1
NEW TRANSCRIPT
ENSG00000187908.1 ENST00000338354.1 ENSE00001364942.1
ENSE00001379878.1 ENSE00001376065.1 ENSE00001379576.1
NEW GENE
ENSG00000129899.5 ENST00000306922.4 ENSE00001350558.2
ENSE00001316817.3 ENSE00001149607.1 ENSE00001149600.1
ENSE00001149591.1 ENSE00001149579.1 ENSE00001383071.1
ENSE00001149558.2 ENSE00001149547.1 ENSE00001302825.1
ENSE00001149539.1 ENSE00001149529.1 ENSE00001186785.1
ENSE00001350475.1 ENSE00001350469.2 ENSE00001350465.2
ENSE00001350461.2 ENSE00001350458.2 ENSE00001309288.1
ENSE00001149467.2 ENSE00001250660.4
NEW TRANSCRIPT
ENSG00000129899.5 ENST00000306944.5 ENSE00001350584.2
ENSE00001316817.3 ENSE00001149607.1 ENSE00001149600.1
ENSE00001149591.1 ENSE00001149579.1 ENSE00001383071.1
ENSE00001149558.2 ENSE00001149547.1 ENSE00001302825.1
ENSE00001149539.1 ENSE00001149529.1 ENSE00001186785.1
I would like to use sets to generate a txt file of any data in the
transcripts that is repeated between each transcript. So for a given
gene each if each transcript has exons x and y then I want to know
that.
In another way, within a group (GENE) if all subgroups(transcripts)
contain exon(s) x(y,z, or more) then write to a file that contain
gene id then all exons that are in all transcripts for gene x
ENSG00000129899.5 ENSE00001149539.1 ENSE00001149579.1
the output file is equivilent to
group x data 3 data4
group x data 7 data 9 data 10
etc.
here is what i am trying-the print statements are for checking the
program as I go along
they will be removed
info=re.compile('^(ENSG\d+\.\d).+(ENST\d+\.\d).+(ENSE\d+\.\d)+')
#above is match gene, transcript, then one or more exons
exonArray=[]
geneflag=0
transArray=[]
AllTrans=[]
for line in TFILE:
Matched2= info.match(line)
if line.startswith('NEW GENE'):
geneflag=geneflag+1
transloop=0
if Matched2:
if line.startswith('ENS'):
geneid,transid,exons=line.split(None,2)
exonArray=exons.split()
print "this is the gene "+geneid
print "this is the transcript "+transid
print "these are the exons: \n"
for exon in exonArray:
print exon ," ",
print "\n"
print transloop
#up to here seems to be working fine by the output
#problems here
transArray[transloop]=exonArray
#transArray[0]=exonArray tried this same error
transloop=transloop+1
AllTrans[geneflag]=transArray[transloop]
if not line.startswith('ENS'):
break
when run I get:
Z:\datasets>C:\scomel\python2.3.4\python.exe Z:\scripts\MondayExonRemoval.py Mo
dayTest.txt MonOct18spam.txt
this is the gene ENSG00000187908.1
this is the transcript ENST00000339871.1
these are the exons:
ENSE00001383339.1 ENSE00001387275.1 ENSE00001378578.1 ENSE00001379544.1
ENSE00001368222.1 ENSE00001372264.1 ENSE00001365999.1 ENSE00001377564.1
ENSE00001382923.1 ENSE00001366872.1 ENSE00001372652.1 ENSE00001374822.1
ENSE00001390913.1 ENSE00001386215.1 ENSE00001378373.1 ENSE00001389805.1
ENSE00001367196.1 ENSE00001377652.1 ENSE00001375990.1 ENSE00001386225.1
0
Traceback (most recent call last):
File "Z:\scripts\MondayExonRemoval.py", line 58, in ?
transArray[transloop]=exonArray
IndexError: list assignment index out of range
I was thinking it would be an set of nested lists
AlllTrans [0] would be the first gene group which is set to the geneflag number
and the group is transArray (list of transcripts) each of which has
the list of exons
these nested lists are making me a bit dizzy but I am not seeing a
clearer way to go on.
I was going to convert each list into a set and use the sets to pull
out what was the same within each.
Any suggestions and help appreciated
--
Scott Melnyk
melnyk at gmail.com
More information about the Tutor
mailing list