[Baypiggies] string to list question

Zachary Collins recursive.cookie.jar at gmail.com
Fri Aug 6 06:59:22 CEST 2010


Speaking of, if you need to break it up into groups of 3's or anything
else, you can very easily change the regex I gave to accomodate this
or any multitude of possible grammars you can conceive.  I gave my
solution not as a definite biology solution, but as a "hey, here's the
simplest most pythonic way to solve parsing like that" that can be
redefined for your domain.

So if you want groups of 3 within that grammar, you could try:
>>> import re
>>> z = "ATC/GACTGAGC/TAG"
>>> [s.group() for s in re.finditer("([A-Z]/[A-Z]|[A-Z]){3}", z)]
['ATC/G', 'ACT', 'GAG', 'C/TAG']

And yes, if you want to restrict the domain, you can change A-Z to
whatever characters are actually available in the domain.  This is
just a simple took me 5 seconds approach that gives you the general
direction to solving any multitude of parsing string problems you
might think of.

The Python library loves you :]

2010/8/6 Vikram K <kpguy1975 at gmail.com>:
> Hi Glen,
> thanks for your response. I am afraid i did not present the problem with
> clarity in my original query. The more generalized query is what if:
>
> z = 'ATC/GACTGAGC/TAG'
>
> and  i want
> zlist = ['ATC/G','ACT','GAG','C/TAG']
>
> The biology behind this is not as what you have understood. Here is the
> problem for you and others interested (i am simplying this as much as i can
> since i dont know your biological background):
> 'C/G' and 'C/T' are SNPs (single nucleotide polymorphisms, which can be
> thought of simply as 'change') in a particular genome being studied when
> compared to the NCBI reference genome. A specific nucleotide (say 'A') is
> being represented by two alternative nucleotides (say 'A/G') in the genome
> being investigated. The alternative nucleotides could occur because at that
> position there is a difference in the coding and complementary DNA strands
> (think of this as a difference between the paternal and maternal DNA strands
> at that position).
>
> When i take the exon regions of a gene (that are making proteins)  in the
> genome being studied i need to break up the dna string corresponding to the
> exon region in groups of  three to get the codons and then find the
> corresponding amino acid sequence using the genetic code. In doing this i
> want something like 'A/G' to be taken as a single character. ['AT/CG'] will
> be then correspond to two alternative amino acids corresponding to ATC and
> ATG. [ATG (DNA) corresponds to AUG(mRNA). ]
>
> On Thu, Aug 5, 2010 at 9:31 PM, Glen Jarvis <glen at glenjarvis.com> wrote:
>>
>> Vikram,
>>     I recognize this domain in many of the questions that have been asked.
>> There are several times where I've thought, "That *so* isn't the most ideal
>> 'Computer Science' way to do something." But, I also recognize that,
>> especially in the Biological world, we have no control how we receive the
>> data and thus, we still have to solve problems like those reviewed.
>>    So, I normally don't challenge the base assumption in the question
>> because I know from experience, we don't always get the most ideal inputs to
>> work with. HOWEVER, I do want to challenge this one because I know there's a
>> standard way that this is represented in the Biological community without
>> using three characters for a single base. I recognize your original question
>> of z = 'AT/CG' to mean, In Biological terms, that:
>> "Zee equals the string of three nucleotide bases. The first base is
>> Adenine. The second base is either Thymine or Cytosine. The third base is
>> Guanine."
>> There's a *much* better (and commonly accepted) way to represent this.
>> The way this is traditionally is represented is with the extended
>> genetic alphabet (http://www.hrbc-genomics.net/training/bcd/Curric/PrwAli/node7.html).
>> In this case, the middle base would be represented by the letter Y as that
>> means either Thymine or Cytosine.
>> I feel it's much better to represent this as:
>> z = 'AYG'
>> Then, the string will work without any expected manipulations. I would
>> always work with the alphabet and not put the three character string back in
>> as this alphabet is defined and accepted in the community. However, if one
>> wanted to they still could later represent this in a 'lookup dictionary'
>> such as follows if the output ever needed to be in a the format in question.
>> lookup = {'R': 'G/A',
>>               'Y': 'T/C',
>>               'M': 'A/C',....}
>> Cheers,
>>
>> Glen
>>
>> On Wed, Aug 4, 2010 at 9:37 PM, Vikram K <kpguy1975 at gmail.com> wrote:
>>>
>>> Suppose i have this string:
>>> z = 'AT/CG'
>>>
>>> How do i get this list:
>>>
>>> zlist = ['A','T/C','G']
>>>
>>>
>>> _______________________________________________
>>> Baypiggies mailing list
>>> Baypiggies at python.org
>>> To change your subscription options or unsubscribe:
>>> http://mail.python.org/mailman/listinfo/baypiggies
>>
>>
>>
>> --
>> Whatever you can do or imagine, begin it;
>> boldness has beauty, magic, and power in it.
>>
>> -- Goethe
>
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>


More information about the Baypiggies mailing list