Regular Expression - Matching Multiples of 3 Characters exactly.

castironpi at gmail.com castironpi at gmail.com
Sun Apr 27 22:24:49 EDT 2008


On Apr 27, 8:31 pm, blaine <frik... at gmail.com> wrote:
> Hey everyone,
>   For the regular expression gurus...
>
> I'm trying to write a string matching algorithm for genomic
> sequences.  I'm pulling out Genes from a large genomic pattern, with
> certain start and stop codons on either side.  This is simple
> enough... for example:
>
> start = AUG stop=AGG
> BBBBBBAUGWWWWWWAGGBBBBBB
>
> So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
> This works great with my current regular expression.
>
> The problem, however, is that codons come in sets of 3 bases.  So
> there are actually three different 'frames' I could be using.  For
> example:
> ABCDEFGHIJ
> I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
>
> So finally, my question.  How can I represent this in a regular
> expression? :)  This is what I'd like to do:
> (Find all groups of any three characters) (Find a start codon) (find
> any other codons) (Find an end codon)
>
> Is this possible? It seems that I'd want to do something like this: (\w
> \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
> three non-whitespace characters, followed by AUG \s AGG, and then
> anything else.  I hope I am making sense.  Obviously, however, this
> will make sure that ANY set of three characters exist before a start
> codon.  Is there a way to match exactly, to say something like 'Find
> all sets of three, then AUG and AGG, etc.'.  This way, I could scan
> for genes, remove the first letter, scan for more genes, remove the
> first letter again, and scan for more genes.  This would
> hypothetically yield different genes, since the frame would be
> shifted.
>
> This might be a lot of information... I appreciate any insight.  Thank
> you!
> Blaine

Here's one idea (untested):

s= { }
for x in range( len( genes )- 3 ):
   s[ x ]= genes[ x: x+ 3 ]

You might like Python's 'string slicing' feature.



More information about the Python-list mailing list