Regular Expression - Matching Multiples of 3 Characters exactly.

blaine frikker at gmail.com
Sun Apr 27 21:31:45 EDT 2008


Hey everyone,
  For the regular expression gurus...

I'm trying to write a string matching algorithm for genomic
sequences.  I'm pulling out Genes from a large genomic pattern, with
certain start and stop codons on either side.  This is simple
enough... for example:

start = AUG stop=AGG
BBBBBBAUGWWWWWWAGGBBBBBB

So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
This works great with my current regular expression.

The problem, however, is that codons come in sets of 3 bases.  So
there are actually three different 'frames' I could be using.  For
example:
ABCDEFGHIJ
I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.

So finally, my question.  How can I represent this in a regular
expression? :)  This is what I'd like to do:
(Find all groups of any three characters) (Find a start codon) (find
any other codons) (Find an end codon)

Is this possible? It seems that I'd want to do something like this: (\w
\w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
three non-whitespace characters, followed by AUG \s AGG, and then
anything else.  I hope I am making sense.  Obviously, however, this
will make sure that ANY set of three characters exist before a start
codon.  Is there a way to match exactly, to say something like 'Find
all sets of three, then AUG and AGG, etc.'.  This way, I could scan
for genes, remove the first letter, scan for more genes, remove the
first letter again, and scan for more genes.  This would
hypothetically yield different genes, since the frame would be
shifted.

This might be a lot of information... I appreciate any insight.  Thank
you!
Blaine



More information about the Python-list mailing list