substitution

Fri Jan 22 02:29:10 EST 2010

Iain King wrote:
> On Jan 21, 2:18 pm, Wilbert Berendsen <wbs... at xs4all.nl> wrote:
>   
>> Op maandag 18 januari 2010 schreef Adi:
>>
>>     
>>> keys = [(len(key), key) for key in mapping.keys()]
>>> keys.sort(reverse=True)
>>> keys = [key for (_, key) in keys]
>>>       
>>> pattern = "(%s)" % "|".join(keys)
>>> repl = lambda x : mapping[x.group(1)]
>>> s = "fooxxxbazyyyquuux"
>>>       
>>> re.subn(pattern, repl, s)
>>>       
>> I managed to make it even shorted, using the key argument for sorted, not
>> putting the whole regexp inside parentheses and pre-compiling the regular
>> expression:
>>
>> import re
>>
>> mapping = {
>>         "foo" : "bar",
>>         "baz" : "quux",
>>         "quuux" : "foo"
>>
>> }
>>
>> # sort the keys, longest first, so 'aa' gets matched before 'a', because
>> # in Python regexps the first match (going from left to right) in a
>> # |-separated group is taken
>> keys = sorted(mapping.keys(), key=len)
>>
>> rx = re.compile("|".join(keys))
>> repl = lambda x: mapping[x.group()]
>> s = "fooxxxbazyyyquuux"
>> rx.sub(repl, s)
>>
>> One thing remaining: if the replacement keys could contain non-alphanumeric
>> characters, they should be escaped using re.escape:
>>
>> rx = re.compile("|".join(re.escape(key) for key in keys))
>>
>> Met vriendelijke groet,
>> Wilbert Berendsen
>>
>> --http://www.wilbertberendsen.nl/
>> "You must be the change you wish to see in the world."
>>         -- Mahatma Gandhi
>>     
>
> Sorting it isn't the right solution: easier to hold the subs as tuple
> pairs and by doing so let the user specify order.  Think of the
> following subs:
>
> "fooxx" -> "baz"
> "oxxx" -> "bar"
>
> does the user want "bazxbazyyyquuux" or "fobarbazyyyquuux"?
>
> Iain
>   
There is no way you can automate a user's choice. If he wants the second 
choice (oxxx->bar) he would have to add a third pattern: fooxxx -> 
fobar. In general, the rules 'upstream over downstream' and 'long over 
short' make sense in practically all cases. With all but simple 
substitution runs whose functionality is obvious, the result needs to be 
checked for unintended hits. To use an example from my SE manual which 
runs a (whimsical) text through a set of substitutions concentrating 
overlapping targets:

 >>> substitutions = [['be', 'BE'], ['being', 'BEING'], ['been', 
'BEEN'], ['bee', 'BEE'], ['belong', 'BELONG'], ['long', 'LONG'], 
['longer', 'LONGER']]
 >>> T = Translator (substitutions)  # Code further up in this thread 
handling precedence by the two rules mentioned
 >>> text =  "There was a bee named Mabel belonging to hive nine longing 
to be a beetle and thinking that being a bee was okay, but she had been 
a bee long enough and wouldn't be one much longer."
 >>> print T (text)
There was a BEE named MaBEl BELONGing to hive nine LONGing to BE a 
BEEtle and thinking that BEING a BEE was okay, but she had BEEN a BEE 
LONG enough and wouldn't BE one much LONGER.

All word-length substitutions resolve correctly. There are four 
unintended translations, though: MaBEl, BELONGing, LONGing and BEEtle. 
Adding the substitution Mabel->Mabel would prevent the first miss. The 
others could be taken care of similarly by replacing the target with 
itself. With large substitution sets and extensive data, this amounts to 
an iterative process of running, checking and fixing, many times over. 
That just isn't practical and may have to be abandoned when the 
substitutions catalog grows out of reasonable bounds. Dependable are 
runs where the targets are predictably singular, such as long id numbers 
that cannot possibly match anything but id numbers.

Frederic