processing the genetic code with python?

James Stroud jstroud at ucla.edu
Mon Mar 6 15:53:25 EST 2006


nuttydevil wrote:
> I have many notepad documents that all contain long chunks of genetic
> code. They look something like this:
> 
> atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
> tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
> agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
> ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa
> 
> Basically, I want to design a program using python that can open and
> read these documents. However, I want them to be read 3 base pairs at a
> time (to analyse them codon by codon) and find the value that each
> codon has a value assigned to it. An example of this is below:
> 
> ** If the three base pairs were UUU the value assigned to it (from the
> codon value table) would be 0.296
> 
> The program has to read all the sequence three pairs at a time, then I
> want to get all the values for each codon, multiply them together and
> put them to the power of 1 / the length of the sequence in codons
> (which is the length of the whole sequence divided by three).
> 
> However, to make things even more complicated, the notebook sequences
> are in lowercase and the codon value table is in uppercase, so the
> sequences need to be converted into uppercase. Also, the Ts in the DNA
> sequences need to be changed to Us (again to match the codon value
> table). And finally, before the DNA sequences are read and analysed I
> need to remove the first 50 codons (i.e. the first 150 letters) and the
> last 20 codons (the last 60 letters) from the DNA sequence. I've also
> been having problems ensuring the program reads ALL the sequence 3
> letters at a time.
> 
> I've tried various ways of doing this but keep coming unstuck along the
> way. Has anyone got any suggestions for how they would tackle this
> problem?

Yes: use python.

> Thanks for any help recieved!
> 

I couldn't help myself. I strongly suggest you study this example. It 
will cut your coding time way down in the future.

I'm writing your name down and this is the last time I'm doing homework 
for you.

James


from operator import mul

table = { 'AUG' : 0.98999, 'CCC' : 0.9755 } # <== you fill this in
trim_front = 50
trim_back = 20

# Why I did this:
# Python >=1 line per thought; you have to love it
data = "".join([s.strip() for s in open(filename)])
data = data.upper().replace('T', 'U')
codons = [data[i:i+3] for i in xrange(0, len(data), 3)]  # Alex Martelli
trimmed = codons[trim_front:-trim_back]
product = reduce(mul, [table[codon] for codon in codons])
value = product**(1.0/len(trimmed))  # <== is this really ALL codons?

print value       # useless print statement


-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/



More information about the Python-list mailing list