searching substrings with interpositions

Andrew Dalke dalke at dalkescientific.com
Tue May 24 12:04:43 EDT 2005


borges2003xx at yahoo.it wrote:
> the next step of my job is to make limits of lenght of interposed
> sequences (if someone can help me in this way i'll apreciate a lot)
> thanx everyone.

Kent Johnson had the right approach, with regular expressions.
For a bit of optimization, use non-greedy groups.  That will
give you shorter matches.

Suppose you want no more than 10 bases between terms.  You could
use this pattern.

    a.{,10}?t.{,10}?c.{,10}?g.{,10}?


>>> import re
>>> pat = re.compile('a.{,10}t.{,10}c.{,10}g.{,10}?')
>>> m = pat.search("tcgaacccgtaaaaagctaatcg")
>>> m.group(0), m.start(0), m.end(0)
('aacccgtaaaaagctaatcg', 3, 23)
>>> 

>>> pat.search("tcgaacccgtaaaaagctaatttttttg")
<_sre.SRE_Match object at 0x9b950>
>>> pat.search("tcgaacccgtaaaaagctaattttttttg")
>>> 

If you want to know the location of each of the bases, and
you'll have less than 100 of them (I think that's the limit)
then you can use groups in the regular expression language

>>> def make_pattern(s, limit = None):
...     if limit is None:
...         t = ".*?"
...     else:
...         t = ".{,%d}?" % (limit,)
...     text = []
...     for c in s:
...         text.append("(%s)%s" % (c, t))
...     return "".join(text)
... 
>>> make_pattern("atcg")
'(a).*?(t).*?(c).*?(g).*?'
>>> make_pattern("atcg", 10)
'(a).{,10}?(t).{,10}?(c).{,10}?(g).{,10}?'
>>> pat = re.compile(make_pattern("atcg", 10))
>>> m = pat.search("tcgaacccgtaaaaagctaatttttttg")
>>> m
<_sre.SRE_Match object at 0x8ea70>
>>> m.groups()
('a', 't', 'c', 'g')
>>> for i in range(1, len("atcg")+1):
...   print m.group(i), m.start(i), m.end(i)
... 
a 3 4
t 9 10
c 16 17
g 27 28
>>> 



				Andrew
				dalke at dalkescientific.com




More information about the Python-list mailing list