Need a specific sort of string modification. Can someone help?

Roy Smith roy at panix.com
Sat Jan 5 09:12:31 EST 2013


In article <e480480d-f3b4-4491-969c-7d1843bf9e33 at googlegroups.com>,
 Sia <hossein.asgharian at gmail.com> wrote:

> I have strings such as:
> 
> tA.-2AG.-2AG,-2ag
> or
> .+3ACG.+5CAACG.+3ACG.+3ACG

Some kind of DNA binding site?

A couple of questions.  Are the numbers always single digits?  How much 
data is there?  Are we talking a few hundred 20-character strings, or 
all of Genbank?

> The plus and minus signs are always followed by a number (say, i). I want 
> python to find each single plus or minus, remove the sign, the number after 
> it and remove i characters after that. So the two strings above become:
> 
> tA..,
> and
> ...

If I follow your description properly, the last output should be "...." 
(4 dots), right?  This looks like it should work.  I'm sure there's more 
efficient ways to do it, but for small inputs, this should be ok.˜

The general pattern here is a state machine.  It's a good pattern to 
learn if you're going to be doing any kind of sequence analysis.  See, 
for example, http://en.wikipedia.org/wiki/State_machine.

# Build up the new string as a list (for efficiency)                                                
new = []

# Keep track of what state we're in.  The three possible states                                     
# are 1) scanning for a region to be deleted, 2) looking for the                                    
# number, and 3) reading past the letters to be deleted.                                            
SCANNING = 1
NUMBER = 2
DROPPING = 3
state = SCANNING

# If we are in state DROPPING, dropcount is the number of                                           
# letters remaining to be dropped.                                                                  
dropcount = 0

old = '.+3ACG.+5CAACG.+3ACG.+3ACG'
for c in old:
    if state == SCANNING:
        if c in '+-':
            state = NUMBER
        else:
            new.append(c)

    elif state == NUMBER:
        # Assume the counts are all single digits.  If not, then
        # we'll need a 4th state for accumulating the digits.  
        dropcount = int(c)
        state = DROPPING

    else:
        assert state == DROPPING
        dropcount -= 1
        if dropcount == 0:
            state = SCANNING

print ''.join(new)



More information about the Python-list mailing list