[Tutor] need help with comparing list of sequences in Python!!

Tue Aug 31 13:04:09 CEST 2004

Fuzzi,

Here is one way to do this:
- Use zip() to pair up elements from the two sequences
 >>> s1='aaabbbbcccc'
 >>> s2='aaaccccbcccccccccc'
 >>> zip(s1, s2)
[('a', 'a'), ('a', 'a'), ('a', 'a'), ('b', 'c'), ('b', 'c'), ('b', 'c'), 
('b', 'c'), ('c', 'b'), ('c', 'c'), ('c', 'c'), ('c', 'c')]

- Use a list comprehension to compare the elements of the pair and put the 
results in a new list. I'm not sure if you want to count the matches or the 
mismatches - your original post says mismatches, but in your example you 
count matches. This example counts matches but it is easy to change.
 >>> [a == b for a, b in zip(s1, s2)]
[True, True, True, False, False, False, False, False, True, True, True]

- In Python, True has a value of 1 and False has a value of 0, so adding up 
the elements of this list gives the number of matches:
 >>> sum([a == b for a, b in zip(s1, s2)])
6

- min() and len() give you the length of the shortest sequence:
 >>> min(len(s1), len(s2))
11

- When you divide, you have to convert one of the numbers to a float or 
Python will use integer division!
 >>> 6/11
0
 >>> float(6)/11
0.54545454545454541

Put this together with the framework that Alan gave you to create a program 
that calculates distances. Then you can start on the randomization part.

Kent

At 04:03 AM 8/31/2004 +0100, Fathima Javeed wrote:
>Hi Kent
>
>To awnser your question:
>well here is how it works
>sequence one = aaabbbbcccc
>length = 11
>
>seq 2 = aaaccccbcccccccccc
>length = 18
>
>to get the pairwise similarity of this score the program compares the letters
>of the two sequences upto length = 11, the length of the shorter sequence.
>
>so a match gets a score of 1, therefore using + for match and x for mismatch
>
>aaabbbbcccc
>aaaccccbcccccccccc
>+++xxxxx+++
>
>there fore the score = 6/11 = 0.5454 or 54%
>
>so you only score the first 11 letters of each score and its is not
>required to compare the rest of the sequence 2. this is what the
>distance matrix is doing
>
>match score == 6
>
>The spaces are deleted to make both of them the same length
>
>
>>From: Kent Johnson <kent_johnson at skillsoft.com>
>>To: "Fathima Javeed" <fathimajaveed at hotmail.com>, tutor at python.org
>>Subject: Re: [Tutor] need help with comparing list of sequences in
>>Python!!
>>Date: Mon, 30 Aug 2004 13:53:19 -0400
>>
>>Fuzzi,
>>
>>How do you count mismatches if the lengths of the sequences are 
>>different? Do you start from the front of both sequences or do you look 
>>for a best match? Do you count the extra characters in the longer string 
>>as mismatches or do you ignore them? An example or two would help.
>>
>>For example if
>>s1=ABCD
>>s2=XABDDYY
>>how many characters do you count as different?
>>
>>Kent
>>
>>At 07:00 PM 8/29/2004 +0100, Fathima Javeed wrote:
>>>Hi,
>>>would really appreciate it if someone could help me in Python as i am 
>>>new to the language.
>>>
>>>Well i have a list of protein sequences in a text file, e.g. (dummy data)
>>>
>>>MVEIGEKAPEIELVDTDLKKVKIPSDFKGKVVVLAFYPAAFTSVCTKEMCTFRDSMAKFNEVNAVVIGISVDP
>>>PFS
>>>
>>>MAPITVGDVVPDGTISFFDENDQLQTVSVHSIAAGKKVILFGVPGAFTPTCSMSHVPGFIGKAEELKSKG
>>>
>>>APIKVGDAIPAVEVFEGEPGNKVNLAELFKGKKGVLFGVPGAFTPGCSKTHLPGFVEQAEALKAKGVQVVACL
>>>SVND
>>>
>>>HGFRFKLVSDEKGEIGMKYGVVRGEGSNLAAERVTFIIDREGNIRAILRNI
>>>
>>>etc etc
>>>
>>>They are not always of the same length,
>>>
>>>The first sequence is always the reference sequence which i am tring to 
>>>investigate, basically to reach the objective, i need to compare each 
>>>sequence with the first one, starting with the the comparison of the 
>>>reference sequence by itself.
>>>
>>>The objective of the program, is to manupulate each sequence i.e. 
>>>randomly change characters and calculate the distance (Distance: Number 
>>>of letters between a pair of sequnces that dont match  DIVIDED by the 
>>>length of the shortest sequence) between the sequence in question 
>>>against the reference sequence. So therefore need  a program code where 
>>>it takes the first sequence as a reference sequence (constant which is 
>>>on top of the list), first it compares it with itself, then it compares 
>>>with the second sequence, then with the third sequence etc etc  each at a time,
>>>
>>>for the first comparison, you take a copy of the ref sequnce and 
>>>manupulate the copied sequence) i.e. randomly changing the letters in 
>>>the sequence, and calculating the distances between them.
>>>(the letters that are used for this are: A R N D C E Q G H I L K M F P S 
>>>T W Y V)
>>>
>>>The reference sequence is never altered or manupulated, for the first 
>>>comparison, its the copied version of the reference sequence thats altered.
>>>
>>>Randomization is done using different P values
>>>e.g for example (P = probability of change)
>>>if P = 0      no random change has been done
>>>if P = 1.0   all the letters in that particular sequence has been 
>>>randomly changed, therefore p=1.0 equals to the length of the sequence
>>>
>>>So its calculating the distance each time between two sequences ( first 
>>>is always the reference sequnce and another second sequence) at each P 
>>>value ( starting from 0, then 0.1, 0.2, ....... 1.0).
>>>
>>>Note: Number of sequnces to be compared could be any number and of any 
>>>length
>>>
>>>I dont know how to compare each sequence with the first sequnce and how 
>>>to do randomization of the characters in the sequnce therefore to 
>>>calculate the distance for each pair of sequnce , if someone can give me 
>>>any guidance, I would be greatful
>>>
>>>Cheers
>>>Fuzzi
>>>
>>>_________________________________________________________________
>>>Stay in touch with absent friends - get MSN Messenger 
>>>http://www.msn.co.uk/messenger
>>>
>>>_______________________________________________
>>>Tutor maillist  -  Tutor at python.org
>>>http://mail.python.org/mailman/listinfo/tutor
>
>_________________________________________________________________
>It's fast, it's easy and it's free. Get MSN Messenger today! 
>http://www.msn.co.uk/messenger