Regular Expression - Matching Multiples of 3 Characters exactly.

Sun Apr 27 22:31:42 EDT 2008

On Apr 27, 10:24 pm, castiro... at gmail.com wrote:
> On Apr 27, 8:31 pm, blaine <frik... at gmail.com> wrote:
>
>
>
> > Hey everyone,
> >   For the regular expression gurus...
>
> > I'm trying to write a string matching algorithm for genomic
> > sequences.  I'm pulling out Genes from a large genomic pattern, with
> > certain start and stop codons on either side.  This is simple
> > enough... for example:
>
> > start = AUG stop=AGG
> > BBBBBBAUGWWWWWWAGGBBBBBB
>
> > So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
> > This works great with my current regular expression.
>
> > The problem, however, is that codons come in sets of 3 bases.  So
> > there are actually three different 'frames' I could be using.  For
> > example:
> > ABCDEFGHIJ
> > I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
>
> > So finally, my question.  How can I represent this in a regular
> > expression? :)  This is what I'd like to do:
> > (Find all groups of any three characters) (Find a start codon) (find
> > any other codons) (Find an end codon)
>
> > Is this possible? It seems that I'd want to do something like this: (\w
> > \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
> > three non-whitespace characters, followed by AUG \s AGG, and then
> > anything else.  I hope I am making sense.  Obviously, however, this
> > will make sure that ANY set of three characters exist before a start
> > codon.  Is there a way to match exactly, to say something like 'Find
> > all sets of three, then AUG and AGG, etc.'.  This way, I could scan
> > for genes, remove the first letter, scan for more genes, remove the
> > first letter again, and scan for more genes.  This would
> > hypothetically yield different genes, since the frame would be
> > shifted.
>
> > This might be a lot of information... I appreciate any insight.  Thank
> > you!
> > Blaine
>
> Here's one idea (untested):
>
> s= { }
> for x in range( len( genes )- 3 ):
>    s[ x ]= genes[ x: x+ 3 ]
>
> You might like Python's 'string slicing' feature.

True - I could try something like that. In fact I have a 'codon'
function that does exactly that.  The problem is that I then have to
go back through and loop over the list.  I'm trying to use Regular
Expressions so that my processing is quicker.  Complexity is key since
this genomic string is pretty large.

Thanks for the suggestion though!