processing the genetic code with python?
David E. Konerding DSD staff
dek at bosshog.lbl.gov
Mon Mar 6 12:47:00 EST 2006
In article <1141657404.593850.151930 at e56g2000cwe.googlegroups.com>, nuttydevil wrote:
> I have many notepad documents that all contain long chunks of genetic
> code. They look something like this:
>
> atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
> tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
> agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
> ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa
>
> Basically, I want to design a program using python that can open and
> read these documents. However, I want them to be read 3 base pairs at a
> time (to analyse them codon by codon) and find the value that each
> codon has a value assigned to it. An example of this is below:
>
> ** If the three base pairs were UUU the value assigned to it (from the
> codon value table) would be 0.296
>
> The program has to read all the sequence three pairs at a time, then I
> want to get all the values for each codon, multiply them together and
> put them to the power of 1 / the length of the sequence in codons
> (which is the length of the whole sequence divided by three).
>
I don't really understand precisely what you're trying to do.
First off, those aren't base pairs, they're bases. Only when you have double-stranded
DNA (or RNA, or some other oddball cases) would they be base pairs.
Second, I don't know what the codon to value function is, is this frequency (IE number n occurences of codon
X out of N total codons)? Or is the lookup table provided for you?
Anyay, I can help you with most of the preprocessing. For example,
>However, to make things even more complicated, the notebook sequences
> are in lowercase and the codon value table is in uppercase, so the
> sequences need to be converted into uppercase. Also, the Ts in the DNA
> sequences need to be changed to Us (again to match the codon value
> table). And finally, before the DNA sequences are read and analysed I
> need to remove the first 50 codons (i.e. the first 150 letters) and the
> last 20 codons (the last 60 letters) from the DNA sequence. I've also
> been having problems ensuring the program reads ALL the sequence 3
> letters at a time.
So, if the file is called "notepad.txt", I'd do what you did above as:
import string
o = open("notepad.txt")
l = o.readlines() ## read all lines
l = map(string.strip, l) ## strip newlines
l = "".join(l) ## join into one string (in case codon boundaries cross lines)
l = l[50:-60]
l = l.upper()
print l
codons = []
for i in range(0, len(l), 3):
codons.append(l[i:i+3])
print codons
That gets you about 30% of the way there.
Dave
More information about the Python-list
mailing list