processing the genetic code with python?

Mon Mar 6 12:47:00 EST 2006

In article <1141657404.593850.151930 at e56g2000cwe.googlegroups.com>, nuttydevil wrote:
> I have many notepad documents that all contain long chunks of genetic
> code. They look something like this:
> 
> atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
> tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
> agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
> ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa
> 
> Basically, I want to design a program using python that can open and
> read these documents. However, I want them to be read 3 base pairs at a
> time (to analyse them codon by codon) and find the value that each
> codon has a value assigned to it. An example of this is below:
> 
> ** If the three base pairs were UUU the value assigned to it (from the
> codon value table) would be 0.296
> 
> The program has to read all the sequence three pairs at a time, then I
> want to get all the values for each codon, multiply them together and
> put them to the power of 1 / the length of the sequence in codons
> (which is the length of the whole sequence divided by three).
> 

I don't really understand precisely what you're trying to do.  

First off, those aren't base pairs, they're bases.  Only when you have double-stranded
DNA (or RNA, or some other oddball cases) would they be base pairs.

Second, I don't know what the codon to value function is, is this frequency (IE number n  occurences of codon
X out of N total codons)?  Or is the lookup table provided for you?

Anyay, I can help you with most of the preprocessing.  For example,

>However, to make things even more complicated, the notebook sequences
> are in lowercase and the codon value table is in uppercase, so the
> sequences need to be converted into uppercase. Also, the Ts in the DNA
> sequences need to be changed to Us (again to match the codon value
> table). And finally, before the DNA sequences are read and analysed I
> need to remove the first 50 codons (i.e. the first 150 letters) and the
> last 20 codons (the last 60 letters) from the DNA sequence. I've also
> been having problems ensuring the program reads ALL the sequence 3
> letters at a time.

So, if the file is called "notepad.txt", I'd do what you did above as:

import string
o = open("notepad.txt")
l = o.readlines()  ## read all lines
l = map(string.strip, l)   ## strip newlines
l = "".join(l)  ## join into one string (in case codon boundaries cross lines)
l = l[50:-60]
l = l.upper()
print l

codons = []
for i in range(0, len(l), 3):
        codons.append(l[i:i+3])

print codons

That gets you about 30% of the way there.

Dave