finding homopolymers in both directions

Tue Aug 3 14:31:35 EDT 2010

Lee Sander wrote:

> Hi,
> Suppose I have a string such as this
> 'aabccccccefggggghiiijkr'
> 
> I would like to print out all the positions that are flanked by a run
> of symbols.
> So for example, I would like to the output for the above input as
> follows:
> 
> 2  b  1 aa
> 2  b  -1 cccccc
> 10  e  -1 cccccc
> 11  f  1 ggggg
> 17 h  1 iii
> 17 h -1 ggggg
> 
> where the first column is the position of interest, the next column is
> the entry at that position,
> 1 if the following column refers to a runs that come after and -1 if
> the runs come before

Trying to follow your spec I came up with

from itertools import groupby
from collections import namedtuple

Item = namedtuple("Item", "pos key size")

def compact(seq):
    pos = 0
    for key, group in groupby(seq):
        size = len(list(group))
        yield Item(pos, key, size)
        pos += size

def window(items):
    items = iter(items)
    prev = None
    cur = next(items)
    for nxt in items:
        yield prev, cur, nxt
        prev = cur
        cur = nxt
    yield prev, cur, None

items = compact("aabccccccefggggghiiijkr")

for prev, cur, nxt in window(items):
    if cur.size == 1:
        if prev is not None:
            if prev.size > 1:
                print cur.pos, cur.key, -1, prev.key*prev.size
        if nxt is not None:
            if nxt.size > 1:
                print cur.pos, cur.key, 1, nxt.key*nxt.size

However, this gives a slightly differenct output:

$ python homopolymers.py
2 b -1 aa
2 b 1 cccccc
9 e -1 cccccc
10 f 1 ggggg
16 h -1 ggggg
16 h 1 iii
20 j -1 iii

Peter