[Tutor] Difflib comparing string sequnces

Ricardo Aráoz ricaraoz at gmail.com
Wed Mar 10 23:57:23 CET 2010


Vincent Davis wrote:
> I have never used the difflib or similar and have a few questions.
> I am working with DNA sequences of length 25. I have a list of 230,000
> and need to look for each sequence in the entire genome (toxoplasma
> parasite) I am not sure how large the genome is but more that 230,000
> sequences.
> The are programs that do this and really fast, and they eve do partial
> matches but not quite what I need. So I am looking to build a custom
> solution.
> I need to look for each of my sequences of 25 characters
> example(AGCCTCCCATGATTGAACAGATCAT).
> The genome is formatted as a continuos string
> (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........)
>
> I don't care where or how many times on if it exists. This is simple I
> think, str.find(AGCCTCCCATGATTGAACAGATCAT)
>
> But I also what to find a close match defined as only wrong at 1
> location and I what to record the location. I am not sure how do do
> this. The only thing I can think of is using a wildcard and performing
> the search with a wildcard in each position. ie 25 time.
> For example
> AGCCTCCCATGATTGAACAGATCAT
> AGCCTCCCATGATAGAACAGATCAT
> close match with a miss-match at position 13

Untested :

genome = 'CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG........'
sequence = 'AGGCTTGCGGAGCCTGAGGGCGGAG'

import fnmatch
for i in range(len(sequence)):
    match = '*' + sequence[0:i] + '?' + sequence[i+1:] + '*'
    if fnmatch.fnmatch(genome, match)
       print 'It matches'



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100310/d875b3a7/attachment.html>


More information about the Tutor mailing list