substitution

Mon Jan 18 11:46:45 EST 2010

On Jan 18, 4:26 pm, Steven D'Aprano <st... at REMOVE-THIS-
cybersource.com.au> wrote:
> On Mon, 18 Jan 2010 06:23:44 -0800, Iain King wrote:
> > On Jan 18, 2:17 pm, Adi Eyal <a... at digitaltrowel.com> wrote:
> [...]
> >> Using regular expressions the answer is short (and sweet)
>
> >> mapping = {
> >>         "foo" : "bar",
> >>         "baz" : "quux",
> >>         "quuux" : "foo"
>
> >> }
>
> >> pattern = "(%s)" % "|".join(mapping.keys())
> >> repl = lambda x : mapping.get(x.group(1), x.group(1))
> >> s = "fooxxxbazyyyquuux"
> >> re.subn(pattern, repl, s)
>
> > Winner! :)
>
> What are the rules for being declared "Winner"? For the simple case
> given, calling s.replace three times is much faster: more than twice as
> fast.
>
> But a bigger problem is that the above "winner" may not work correctly if
> there are conflicts between the target strings (e.g. 'a'->'X',
> 'aa'->'Y'). The problem is that the result you get depends on the order
> of the searches, BUT as given, that order is non-deterministic.
> dict.keys() returns in an arbitrary order, which means the caller can't
> specify the order except by accident. For example:
>
> >>> repl = lambda x : m[x.group(1)]
> >>> m = {'aa': 'Y', 'a': 'X'}
> >>> pattern = "(%s)" % "|".join(m.keys())
> >>> subn(pattern, repl, 'aaa')  # expecting 'YX'
>
> ('XXX', 3)
>
> The result that you get using this method will be consistent but
> arbitrary and unpredictable.
>
> For those who care, here's my timing code:
>
> from timeit import Timer
>
> setup = """
> mapping = {"foo" : "bar", "baz" : "quux", "quuux" : "foo"}
> pattern = "(%s)" % "|".join(mapping.keys())
> repl = lambda x : mapping.get(x.group(1), x.group(1))
> repl = lambda x : mapping[x.group(1)]
> s = "fooxxxbazyyyquuux"
> from re import subn
> """
>
> t1 = Timer("subn(pattern, repl, s)", setup)
> t2 = Timer(
> "s.replace('foo', 'bar').replace('baz', 'quux').replace('quuux', 'foo')",
> "s = 'fooxxxbazyyyquuux'")
>
> And the results on my PC:
>
> >>> min(t1.repeat(number=100000))
> 1.1273870468139648
> >>> min(t2.repeat(number=100000))
>
> 0.49491715431213379
>
> --
> Steven

Adi elicited that response from me because his solution was vastly
more succinct than everything else that had appeared up til that point
while still meeting the OP's requirements.  The OP never cared about
overlap between 2 'find' strings, just between the 'find' and
'replace' strings (though I did take it into account in my second post
for the sake of completeness).  His code could have been a little
cleaner, I'd have trimmed it to:

mapping = {"foo": "bar", "baz": "quux", "quuux": "foo"}
pattern = "(%s)" % "|".join(mapping)
repl = lambda x : mapping[x.group(1)]
s = "fooxxxbazyyyquuux"
re.subn(pattern, repl, s)

but apart from that was very pythonic: explicit, succinct, and all the
heavy work is done by the module (i.e. in compiled c code in the
majority case of CPython).  It can be 'upgraded' to cover the find-
find overlap if you really want (I *believe* regexps will match the
leftmost option in a group first):

subs = [("foo", "bar"), ("baz", "quux"), ("quuux", "foo")]
pattern = "(%s)" % "|".join((x[0] for x in subs))
mapping = dict(subs)
repl = lambda x : mapping[x.group(1)]
s = "fooxxxbazyyyquuux"
re.subn(pattern, repl, s)

Anyway, there's no prize for winning, but by all means: if you think
someone else's code and not a variation on this should win for most
pythonic, then make your nomination :)

Iain