[Tutor] Finding all locations of a sequence

Fri Jun 15 01:08:27 CEST 2007

"Lauren" <laurenb01 at gmail.com> wrote

Caveat: I am not into the realms of DNA sequencing so this may
not be viable but...

> Say I have chicken and I want to know where it occurs in a string of
> words, but I want it to match to both chicken and poultry and have 
> the
> output of:
>
> chicken  (locations of chicken and poultry in the string)

When searching for more than one pattern at a time I'd go
for a regex. A simple string search is faster on its own but
a single regex search will typically be faster than a repeated
string search.

For the simple case above a search for  (chicken)|(poultry)
should work:

>>> import re
>>> s = ''' there are a lot of chickens in my poultry farm but
...     very few could be called a spring chicken'''
...
>>> regex = '(chicken)|(poultry)'
>>> r = re.compile(regex)
...
>>> r.findall(s)
...
[('chicken', ''), ('', 'poultry'), ('chicken', '')]
>>> [match for match in r.finditer(s)]
[<_sre.SRE_Match object at 0x01E75920>, <_sre.SRE_Match object at 
0x01E758D8>, <_sre.SRE_Match object at 0x01E75968>]
>>>

The match objects will let you find the location in the original
string which I suspect you will need?

> The string I'm dealing with is really large, so whatever will get
> through it the fastest is ideal for me.

Again I expect a regex to be fastest for multiple seach criteria
over a single pass. Now what your regex will look like for
R/DNA sequences I have no idea, but if you can describe it I'm
sure somebody here can help formulate a suitable pattern

-- 
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld.