Need a specific sort of string modification. Can someone help?

Nick Mellor thebalancepro at gmail.com
Mon Jan 7 02:07:36 EST 2013


Hi Sia, 

Thanks for the problem! I hope you find these examples understandable.

Below, find an inflexible but fairly fast single-digit method and a slower (but still respectable) multi-digit method that copes with entirely absent digits after +/- and multi-digit skips such as 12 or 37 or 186 following letters rather than just 3 or 5 or 9. Not knowing your problem domain I'm not sure if these larger skips are likely or even possible, nor whether it's likely that there will be no digit.

Neither method flags the edge case where there are not enough letters to skip before the next +/-. They just swallow that chunk of the string entirely without error. Is that a problem?

Python 2.7 or later.

from string import maketrans
from itertools import takewhile

# function used by takewhile to detect digits
def is_digit(s): return s.isdigit()

class redux:

    def __init__(self):
       intab = '+-'
       outtab = '  '
       self.trantab = maketrans(intab, outtab)


    # simple-minded, fast, 1-digit algorithm
    def reduce_plusminus(self, s):
        list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
                    for r
                    in s.translate(self.trantab).split()]
        return ''.join(list_form)

    # multi-digit algorithm
    def reduce_plusminus_multi_digit(self, s):
        chunks = s.translate(self.trantab).split()
        digits = [list(takewhile(is_digit, r))
                   for r
                   in chunks]
        # zero case ('else 0') is for missing digit(s)
        numbers = [int(''.join(r)) if r else 0
                   for r
                    in digits]
        # how far to skip (number and skipped letters)
        skips = [len(dig) + num
                 for dig, num
                 in zip(digits, numbers)]
        return ''.join([sequence[skipover:]
                        for skipover, sequence
                        in zip(skips, chunks)])

if __name__ == "__main__":
    p = redux()
    print (p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG"))
    print (p.reduce_plusminus("tA.-5AG.-2AG,-2ag"))
    print ('multi-digit...')
    print (p.reduce_plusminus_multi_digit(".+3ACG.+5CAACG.+3ACG.+3ACG"))
    print (p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG"))
    print (p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+ACG.+3ACG"))

    for n in range(100000): p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG")
    
    
Note that the multi-line version above tolerates missing digits: if the number is missing after the '+/-' it doesn't skip any letters. 

Explanation:

String slicing is good in Python, but list comprehensions are often the most powerful and efficient way to transform data. If you've come from another language or you're a newb to programming, it's well worth getting grips with list comprehensions [a for b in lst].

First, +/- are converted to spaces. Splitting the string into a list, breaking at spaces, effectively swallows the +/- characters and leaves us with chunks of string from which to drop the specified number of letters. Having dropped the letters, we convert the transformed list back into a string.

The single-digit version just assumes that the number is always present and always single-digit. If you allow multi-digits be ready to skip those multiple digits, as well as skipping letters, adding another stage to the calculation. E.g. in: 

+3ACG. -> . 

you skip 1 + 3 characters, 1 for the digit, 3 for the following letters as specified by the digit 3. In: 

-11ACGACGACGACG. -> G. 

You skip 2 + 11 characters, 2 characters in "11" and 11 characters following. And incidentally in: 

+ACG. -> ACG. 

there's no digit, so you skip 0 digits + 0 letters. If a lack of digits is a data error then it would be easy to test for-- just look for an empty list in the 'digits' list.

For each chunk (stuff between +/-) I use takewhile to separate the zero or more initial digits from the following letters. itertools.takewhile is a nice tool for matching stuff until the first time a condition ceases to hold. In this case I'm matching digits up until the first non-digit.

If takewhile doesn't find any digits at the start of the sequence, it returns the empty list []. ''.join(list) swallows empty lists so takewhile and ''.join() do behind-the-scenes leg-work for me in detecting the case of absent digit(s).

For my money speed isn't bad for these two methods (again, I don't know your domain.) On my (ancient) hardware the 1-digit version does over 100,000/s for your longer string, the multi-digit about 35,000/s.

HTH,

Nick

On Saturday, 5 January 2013 19:35:26 UTC+11, Sia  wrote:
> I have strings such as:
> 
> 
> 
> tA.-2AG.-2AG,-2ag
> 
> or
> 
> .+3ACG.+5CAACG.+3ACG.+3ACG
> 
> 
> 
> The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
> 
> 
> 
> tA..,
> 
> and
> 
> ...
> 
> 
> 
> How can I do that?
> 
> Thanks.



More information about the Python-list mailing list