[Tutor] need help with comparing list of sequences in Python!!

Fathima Javeed fathimajaveed at hotmail.com
Tue Aug 31 05:03:59 CEST 2004


Hi Kent

To awnser your question:
well here is how it works
sequence one = aaabbbbcccc
length = 11

seq 2 = aaaccccbcccccccccc
length = 18

to get the pairwise similarity of this score the program compares the 
letters
of the two sequences upto length = 11, the length of the shorter sequence.

so a match gets a score of 1, therefore using + for match and x for mismatch

aaabbbbcccc
aaaccccbcccccccccc
+++xxxxx+++

there fore the score = 6/11 = 0.5454 or 54%

so you only score the first 11 letters of each score and its is not
required to compare the rest of the sequence 2. this is what the
distance matrix is doing

match score == 6

The spaces are deleted to make both of them the same length


>From: Kent Johnson <kent_johnson at skillsoft.com>
>To: "Fathima Javeed" <fathimajaveed at hotmail.com>, tutor at python.org
>Subject: Re: [Tutor] need help with comparing list of sequences in  
>Python!!
>Date: Mon, 30 Aug 2004 13:53:19 -0400
>
>Fuzzi,
>
>How do you count mismatches if the lengths of the sequences are different? 
>Do you start from the front of both sequences or do you look for a best 
>match? Do you count the extra characters in the longer string as mismatches 
>or do you ignore them? An example or two would help.
>
>For example if
>s1=ABCD
>s2=XABDDYY
>how many characters do you count as different?
>
>Kent
>
>At 07:00 PM 8/29/2004 +0100, Fathima Javeed wrote:
>>Hi,
>>would really appreciate it if someone could help me in Python as i am new 
>>to the language.
>>
>>Well i have a list of protein sequences in a text file, e.g. (dummy data)
>>
>>MVEIGEKAPEIELVDTDLKKVKIPSDFKGKVVVLAFYPAAFTSVCTKEMCTFRDSMAKFNEVNAVVIGISVDP
>>PFS
>>
>>MAPITVGDVVPDGTISFFDENDQLQTVSVHSIAAGKKVILFGVPGAFTPTCSMSHVPGFIGKAEELKSKG
>>
>>APIKVGDAIPAVEVFEGEPGNKVNLAELFKGKKGVLFGVPGAFTPGCSKTHLPGFVEQAEALKAKGVQVVACL
>>SVND
>>
>>HGFRFKLVSDEKGEIGMKYGVVRGEGSNLAAERVTFIIDREGNIRAILRNI
>>
>>etc etc
>>
>>They are not always of the same length,
>>
>>The first sequence is always the reference sequence which i am tring to 
>>investigate, basically to reach the objective, i need to compare each 
>>sequence with the first one, starting with the the comparison of the 
>>reference sequence by itself.
>>
>>The objective of the program, is to manupulate each sequence i.e. randomly 
>>change characters and calculate the distance (Distance: Number of letters 
>>between a pair of sequnces that dont match  DIVIDED by the length of the 
>>shortest sequence) between the sequence in question against the reference 
>>sequence. So therefore need  a program code where it takes the first 
>>sequence as a reference sequence (constant which is on top of the list), 
>>first it compares it with itself, then it compares with the second 
>>sequence, then with the third sequence etc etc  each at a time,
>>
>>for the first comparison, you take a copy of the ref sequnce and 
>>manupulate the copied sequence) i.e. randomly changing the letters in the 
>>sequence, and calculating the distances between them.
>>(the letters that are used for this are: A R N D C E Q G H I L K M F P S T 
>>W Y V)
>>
>>The reference sequence is never altered or manupulated, for the first 
>>comparison, its the copied version of the reference sequence thats 
>>altered.
>>
>>Randomization is done using different P values
>>e.g for example (P = probability of change)
>>if P = 0      no random change has been done
>>if P = 1.0   all the letters in that particular sequence has been randomly 
>>changed, therefore p=1.0 equals to the length of the sequence
>>
>>So its calculating the distance each time between two sequences ( first is 
>>always the reference sequnce and another second sequence) at each P value 
>>( starting from 0, then 0.1, 0.2, ....... 1.0).
>>
>>Note: Number of sequnces to be compared could be any number and of any 
>>length
>>
>>I dont know how to compare each sequence with the first sequnce and how to 
>>do randomization of the characters in the sequnce therefore to calculate 
>>the distance for each pair of sequnce , if someone can give me any 
>>guidance, I would be greatful
>>
>>Cheers
>>Fuzzi
>>
>>_________________________________________________________________
>>Stay in touch with absent friends - get MSN Messenger 
>>http://www.msn.co.uk/messenger
>>
>>_______________________________________________
>>Tutor maillist  -  Tutor at python.org
>>http://mail.python.org/mailman/listinfo/tutor
>

_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today! 
http://www.msn.co.uk/messenger



More information about the Tutor mailing list