substitution

Mon Jan 18 14:24:24 EST 2010

> From: Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au>
> To: python-list at python.org
> Date: 18 Jan 2010 16:26:48 GMT
> Subject: Re: substitution
> On Mon, 18 Jan 2010 06:23:44 -0800, Iain King wrote:
>
>> On Jan 18, 2:17 pm, Adi Eyal <a... at digitaltrowel.com> wrote:
> [...]
>>> Using regular expressions the answer is short (and sweet)
>>>
>>> mapping = {
>>>         "foo" : "bar",
>>>         "baz" : "quux",
>>>         "quuux" : "foo"
>>>
>>> }
>>>
>>> pattern = "(%s)" % "|".join(mapping.keys())
>>> repl = lambda x : mapping.get(x.group(1), x.group(1))
>>> s = "fooxxxbazyyyquuux"
>>> re.subn(pattern, repl, s)
>>
>> Winner! :)
>
> What are the rules for being declared "Winner"? For the simple case
> given, calling s.replace three times is much faster: more than twice as
> fast.

:) - The solution above is short and easy to read and digest. It is
also easy to maintain - you simply need to add additional entries to
the mapping dict if you need to add additional strings.

Calling replace 3 times in a row doesn't work correctly
e.g.

foo => bar
bar => baz

foo.replace("foo", "bar").replace("bar", "baz")

>
> But a bigger problem is that the above "winner" may not work correctly if
> there are conflicts between the target strings (e.g. 'a'->'X',
> 'aa'->'Y'). The problem is that the result you get depends on the order
> of the searches, BUT as given, that order is non-deterministic.
> dict.keys() returns in an arbitrary order, which means the caller can't
> specify the order except by accident. For example:
>
>>>> repl = lambda x : m[x.group(1)]
>>>> m = {'aa': 'Y', 'a': 'X'}
>>>> pattern = "(%s)" % "|".join(m.keys())
>>>> subn(pattern, repl, 'aaa')  # expecting 'YX'
> ('XXX', 3)
>
> The result that you get using this method will be consistent but
> arbitrary and unpredictable.
>

That's a simple fix

import re

mapping = {
        "foo" : "bar",
        "baz" : "quux",
        "quuux" : "foo"
}

keys = [(len(key), key) for key in mapping.keys()]
keys.sort(reverse=True)
keys = [key for (_, key) in keys]

pattern = "(%s)" % "|".join(keys)
repl = lambda x : mapping[x.group(1)]
s = "fooxxxbazyyyquuux"

re.subn(pattern, repl, s)

With regards to timing - I suspect that as the input string grows, the
ratio of the regexp solution to the string solution will grow. If the
map grows, the regexp solution seems faster (see below). Python's
regular expression state machine is very inefficient with lots of
disjunctions since it involves a lot of backtracking - this can be
hand optimised for a much greater speed improvement.

N = 1000
alphabet = "abcdefghijklmnopqrstuvwxyz"
make_token = lambda : "".join(sample(alphabet, 5))
mapping = dict([(make_token(), make_token()) for i in range(N)])

keys = [(len(key), key) for key in mapping.keys()]
keys.sort(reverse=True)
keys = [key for (_, key) in keys]

pattern = "(%s)" % "|".join(keys)
replace_strs = ["replace('%s', '%s')" % (key, mapping[key]) for key in keys]
command = "s." + ".".join(replace_strs)

setup = """
import re

regexp = re.compile("%s")
repl = lambda x : mapping[x.group(1)]
s = "fooxxxbazyyyquuux"
""" % pattern

t1 = Timer("regexp.subn(repl, s)", setup)
t2 = Timer(command, "s = 'fooxxxbazyyyquuux'")

res1 = sum(t1.repeat(number=1000))
res2 = sum(t2.repeat(number=1000))
print res1 / res2

on my machine this comes to
>>> 0.316323109737