Help, Search and Substitute

Alex Martelli aleaxit at yahoo.com
Fri Nov 3 04:39:54 EST 2000


<andy47 at my-deja.com> wrote in message news:8ttai1$c53$1 at nnrp1.deja.com...
    [snip]
> > def symbol_replace(match, get=symbol_map.get):
> >     return get(match.group(1), "")
    [snip]
> > print symbol_pattern.sub(symbol_replace, "{foo}fie{bar}")
    [snip]
> I'm probably being really dumb here. I can see that the above works,
> and I've got a vague idea of how it works but the detail is escaping
> me. Can anyone explain how it works in words of one syllable to us
> newbies? In particular, I'm a little lost as to how the function
> symbol_replace works.

Well, these few lines do use several somewhat-advanced Python
techniques, so let's go over them in detail; apologies in
advance for repetitiousness in the following, but I'm trying
to be as thorough and complete as feasible... somebody with
some gift for concision can no doubt summarize the following
in 12 or less words of 1 or fewer syllables each!-)

Starting "from the end"...:

The .sub method of a compiled RE (here, the compiled RE is what
symbol_pattern is bound to) takes two arguments, the second one
being the string on which substitutions are to be performed,
and builds and returns a new string-with-substitutions -- OK
so far?  (It could be called with more args to limit the number
of substitutions performed, but this is not being used here: as
called, the method will perform substitution on every non
overlapping match it finds for the RE in the string).

The first argument that is passed to the .sub method of a
compiled RE object can be either a string, or a function.

If it's a string, that's what gets substituted for each (non
overlapping) match of the RE, and that's that.  But, here, we
are using a function as the first argument, so the 'richer' case
applies; that function will be called for each non overlapping
match, and the string that the function returns in each case is
what will be sustituted for that specific match.  OK so far?

The 'function' that is passed as the first argument of .sub is
actually any "callable" object (object which can be called!)
which is able to be called with exactly one argument.  That
argument, for each non-overlapping match, will be the match
object for that match, an instance of class re.MatchObject.

The function called symbol_replace, that we have defined here,
does accept being called with a single argument, which it calls
'match', and that is what gets bound on each call to the
appropriate re.MatchObject instance.  The second argument to
symbol_replace will always be at its default value -- the
bound-method symbol_map.get -- as the .sub method of the compiled
RE always calls the function (that it gets as its first argument)
with exactly one argument (the match-object, i.e., the
appropriate instance of re.MatchObject).

It's a common Python idiom to let a function have "arguments
with default values" which are in fact not used _as arguments_,
but only to "inject" some desired object in the function's
local namespace.  It's Python's way to explicitly build a
little "closure" -- a callable-object that carries around a
few other objects it needs, bound to appropriate names in its
local namespace -- in absence of ``real'' (aka heavyweight)
closures with full lexical nesting &c.

A bound-method is another Python way to build a special case
of a 'closure' -- here, what gets bound (and is carried
around as a part of the callable object) is the specific
object on which a certain method is to be called (this
construct is also called a "delegate" [?] in some other
languages, such as C#; I find the name "bound-method" to
be much clearer, personally!).

So, we're getting close to the finish line.  For each
non-overlapping match of the compiled RE with the target
string (the second argument to the method .sub), the
function symbol_replace will be called with a match
object as its first argument 'match', and its second
argument, 'get', bound to its default value, ready to
call method .get on object symbol_map, a dictionary.

The .group method of a match-object returns the string
corresponding to the match of some group within the RE
that generated the match-object.  .group(0) returns
the string that matched the whole RE; .group(1) returns
the string that matched the first parenthesized group
in the RE.  Here, the specific RE was of the form:
    \{(foo|bar|baz)\}
as dynamically constructed from the .keys() of the
symbol_match dictionary.  The whole RE will match an
open-brace, followed by any one of the 'keywords' we
are using, followed by a closed-brace.  The parenthesized
group (there is only one), in particular, will match
the 'keyword' only, excluding the braces.

So, the call of .group(1) on the match-object will
return the string that is the keyword that was matched.

That keyword will be passed as the first argument
to 'get' -- i.e., bound-method symbol_map.get.  The
.get() method of a dictionary takes two arguments --
the key to get, and the default value to return if
the key is not present (optional, default to None).

Here, the second argument is redundant, since we know
by construction that the first argument will always
be a keyword, so the default-value will never play
a role.  In fact, we could omit the second argument,
or better yet use some redundancy to diagnose any
possible error in our construction -- e.g., emit some
strikingly recognizable string as the default value
(I chose '?????' in my roughly equivalent but less
elegant solution), or maybe even better have an
exception if the key is not found as that indicates
some internal error in our program's logic -- I do
not think we can do the latter with the bound-method
idea, but we could pass the dictionary object instead:

def symbol_replace(match, dict=symbol_map):
    return dict[match.group(1)]

this will ensure a KeyError exception if we have
made some programming-logic mistake; it may be a
tad slower than <F>'s solution, though, since the
lookup of the dictionary method corresponding to
our indexing operation happens at each call, in
this approach, while with the bound-method idea
the lookup price is paid but once (one would have
to profile to know for sure if that matters!).


So, anyway, here we are.  symbol_replace returns the
string that should be substituted for that substring
of the target with the regular-expression matched in
each specific non-overlapping match; the .sub method
collects these sub-strings, and those outside the RE's
non-overlapping matches, and in the end puts Humpty
Dumpty back together again properly, with all of the
substitutions.

Watching it all in action is nice, and without even
bothering a debugger we can easily do it with print:

def symbol_replace(match, dict=symbol_map):
    result = dict[match.group(1)]
    print 'replacing "%s" with "%s"' % (
        match.group(0), result)
    return result


Alex






More information about the Python-list mailing list