[Tutor] Words alignment tool

Kent Johnson kent37 at tds.net
Mon Dec 5 04:17:21 CET 2005


Srinivas Iyyer wrote:

>Dear Expert programmers, 
>
>I aplogise if this mail is out of context here. 
>
>I have a list of elements like these:
>
>Contr1	SPR-10	SPR-101	SPR-125	SPR-137	SPR-139	SPR-143
>contr2	SPR-1	SPR-15  SPR-126	SPR-128	SPR-141	SPR-148	
>contr3	SPR-106	SPR-130	SPR-135	SPR-138	SPR-139	SPR-145
>contr4	SPR-124	SPR-125	SPR-130	SPR-139	SPR-144	SPR-148
>
>
>There are several common elements prefixed with SPR-. 
>Although these elements are sorted in asecending order
>row wise, the common elements are difficult to spot. 
>One has to look for common elements by eyeballing.  
>It would be wonderful if these elements are aligned
>properly by inserting gaps.
>  
>
I think this is much easier than the bioinformatics problem because your 
sequence elements are unique and sorted, and you don't have very much data.

One approach is to create pairs that look like ('SPR-10', 'Contr1') for 
all the data. These pairs can be put into one big list and sorted, then 
grouped by the first element to get what you want. Python 2.4 has the 
groupby() function which makes it easy to do the grouping. For example:

data = '''Contr1    SPR-10  SPR-101 SPR-125 SPR-137 SPR-139 SPR-143
contr2  SPR-1   SPR-15  SPR-126 SPR-128 SPR-141 SPR-148
contr3  SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145
contr4  SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148'''.splitlines()

import itertools, operator

pairs = [] # This will be a list of all the pairs like ('SPR-10', 'Contr1')

for line in data:
    items = line.split()
    name, items = items[0], items[1:]
    # now name is the first item on the line, items is a list of all the 
rest
    # add the pairs for this line to the main list
    pairs.extend( (item, name) for item in items)

pairs.sort()   # Sort the list to bring the first items together

# groupby() will return a sequence of key, group pairs where the key is the
# first element of the group
for k, g in itertools.groupby(pairs, operator.itemgetter(0)):
    print k, [ name for item, name in g ]


The output of this program is
SPR-1 ['contr2']
SPR-10 ['Contr1']
SPR-101 ['Contr1']
SPR-106 ['contr3']
SPR-124 ['contr4']
SPR-125 ['Contr1', 'contr4']
SPR-126 ['contr2']
SPR-128 ['contr2']
SPR-130 ['contr3', 'contr4']
SPR-135 ['contr3']
SPR-137 ['Contr1']
SPR-138 ['contr3']
SPR-139 ['Contr1', 'contr3', 'contr4']
SPR-141 ['contr2']
SPR-143 ['Contr1']
SPR-144 ['contr4']
SPR-145 ['contr3']
SPR-148 ['contr2', 'contr4']
SPR-15 ['contr2']

Converting this to a horizontal display is still a little tricky but 
I'll leave that for you.

I should probably explain more about groupby() and itemgetter() but not 
tonight...

Kent



More information about the Tutor mailing list