substitution

Anthra Norell anthra.norell at bluewin.ch
Mon Jan 18 10:59:57 EST 2010


Anthra Norell wrote:
> superpollo wrote:
>> hi.
>>
>> what is the most pythonic way to substitute substrings?
>>
>> eg: i want to apply:
>>
>> foo --> bar
>> baz --> quux
>> quuux --> foo
>>
>> so that:
>>
>> fooxxxbazyyyquuux --> barxxxquuxyyyfoo
>>
>> bye
>
So it goes. The more it matters, the sillier the misatakes. The method 
__init__ () was incomplete and __call__ () was missing, Sorry abount 
that. Here the whole thing again:


class Translator:                                  
   r"""
       Will translate any number of targets, handling them correctly if 
some overlap.

       Making Translator
           T = Translator (definitions, [eat = 1])
           'definitions' is a sequence of pairs: ((target, 
substitute),(t2, s2), ...)
           'eat' says whether untargeted sections pass (translator) or 
are skipped (extractor).
               Makes a translator by default (eat = False)
               T.eat is an instance attribute that can be changed at any 
time.
           Definitions example: 
(('a','A'),('b','B'),('ab','ab'),('abc','xyz')   # ('ab','ab') see Tricks.
           ('\x0c', 'page break'), ('\r\n','\n'), ('   ','\t'))
           Order doesn't matter.        
       Running
           translation = T (source)

       Tricks
           Deletion:  ('target', '')
           Exception: (('\n',''), ('\n\n','\n\n'))     # Eat LF except 
paragraph breaks.
           Exception: (('\n', '\r\n'), ('\r\n',\r\n')) # Unix to DOS, 
would leave DOS unchanged
           Translation cascade:
               # Rejoin text lines per paragraph Unix or DOS, inserting 
inter-word space if missing
               Mark_LF = Translator 
((('\n','+LF+'),('\r\n','+LF+'),('\r\n\r\n','\r\n\r\n'),('\n\n','\n\n')))
               # Pick positively identifiable mark for Unix and DOS end 
of lines                   Single_Space_Mark = Translator (((' +LF+', ' 
'),('+LF+', ' '),('-+LF+', '')))
               no_lf_text = Single_Space_Mark (Mark_LF (text))
           Translation cascade:
               # Nesting calls
               reptiles = T_latin_english (T_german_latin (reptilien))

       Limitations
           1. The number of substitutions and the maximum size of input 
depends on the respective
               capabilities of the Python re module.
           2. Regular expressions will not work as such.

       Author:
           Frederic Rentsch (anthra.norell at bluewin.ch).
               """

    def __init__ (self, definitions, eat = 0):

        '''
            definitions: a sequence of pairs of strings. ((target, 
substitute), (t, s), ...)
            eat: False (0) means translate: unaffected data passes 
unaltered.
                 True  (1) means extract:   unaffected data doesn't pass 
(gets eaten).
                 Extraction filters typically require substitutes to end 
with some separator,
                 else they fuse together. (E.g. ' ', '\t' or '\n')
            'eat' is an attribute that can be switched anytime.

        '''           
        self.eat = eat
        self.compile_sequence_of_pairs (definitions)
       
   
    def compile_sequence_of_pairs (self, definitions):

        '''
            Argument 'definitions' is a sequence of pairs:
            (('target 1', 'substitute 1'), ('t2', 's2'), ...)
            Order doesn't matter.        

        '''
                   
        import re
        self.definitions = definitions
        targets, substitutes = zip (*definitions)
        re_targets = [re.escape (item) for item in targets]
        re_targets.sort (reverse = True)
        self.targets_set = set (targets)                          
        self.table = dict (definitions)
        regex_string = '|'.join (re_targets)
        self.regex = re.compile (regex_string, re.DOTALL)
           
   
    def __call__ (self, s):
        hits = self.regex.findall (s)
        nohits = self.regex.split (s)
        valid_hits = set (hits) & self.targets_set  # Ignore targets 
with illegal re modifiers.
        if valid_hits:
            substitutes = [self.table [item] for item in hits if item in 
valid_hits] + []  # Make lengths equal for zip to work right
            if self.eat:
                return ''.join (substitutes)
            else:           
                zipped = zip (nohits, substitutes)
                return ''.join (list (reduce (lambda a, b: a + b, 
[zipped][0]))) + nohits [-1]
        else:
            if self.eat:
                return ''
            else:
                return s




More information about the Python-list mailing list