Using a function for regular expression substitution

naugiedoggie michael.a.powe at gmail.com
Mon Aug 30 08:52:02 EDT 2010


On Aug 29, 1:14 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> On 29/08/2010 15:22, naugiedoggie wrote:

> > I'm having a problem with using a function as the replacement in
> > re.sub().

> > Here is the function:

> > def normalize(s) :
> >      return
> > urllib.quote(string.capwords(urllib.unquote(s.group('provider'))))
>
> This normalises the provider and returns only that, and none of the
> remainder of the string.
>
> I think you might want this:
>
> def normalize(s):
>      return s[ : s.start('provider')] +
> urllib.quote(string.capwords(urllib.unquote(s.group('provider')))) +
> s[s.start('provider') : ]
>
> It returns the part before the provider, followed by the normalised
> provider, and then the part after the provider.

Hello,

Thanks for the reply.

There must be something basic about the re.sub() function that I'm
missing.  The documentation shows this example:

<code>
>>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return ' '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'
</code>

According to the doc, the modifying function takes one parameter, the
MatchObject.  The re.sub function takes only a compiled regex object
or a pattern, generates a MatchObject from that object/pattern and
passes the MatchObject to the given function. Notice that in the
examples, the re.sub() returns the entire line, with the changes made.
But the function itself returns only the change.  What is happening
for me is that, if I have a line that contains
&Search_Provider=chen&p=value, the processed line ends up with
&Chen&p=value.

Now, I did follow up with your suggestion.  `s' is actually a
MatchObject (bad param naming on my part, I started out passing a
string into the function and then changed it to a MatchObject, but
didn't change the param name), so I made the following change:

<code>
return line[s.pos : s.start('provider')] + \
 
urllib.quote(string.capwords(urllib.unquote(s.group('provider')))) + \
        line[s.end('provider') : ]
</code>

In order to make this work (finally), I had to make the processing
function look like this:

<code>
def processLine(l) :
        global line
        line = l
        provider = getProvider(line)
        if provider == "No Provider" : return line
        scenario = getScenario(line)
        if filter (lambda a: a != None, [getOrg(s,scenario) for s in
orgs]) == [] :
            line = re.sub(provider_pattern,normalize,line)
        else :
            line.replace(provider_parameter, org_parameter)
        return line
</code>

And then the call:

<code>
lines = fileReader.readlines()
[ fileWriter.write(l) for l in [processLine(l) for l in lines]]
</code>

Without this complicated gobbledigook, I could not get the correct
result.  I hate global vars and I completely do not understand why I
have to go through this twisting and turning to get the desired
result.

[ ... ]

> These can be replaced by:
>
>         if 'Search_Type' in line and 'Search_Provider' in line:
>
> >            re.sub(provider_matcher,normalize,line)
>
> re.sub is returning the result, which you're throwing away!
>
>                 line = re.sub(provider_matcher,normalize,line)

I can't count the number of times I have forgotten the meaning of
'returns a string' when reading docs about doing substitutions. In
this case, I had put the `line = ' in and taken it out.  And I should
know better, from years of programming in Java, where strings are
immutable and you _always_ get a new, returned string.  Should be
second nature.

Thanks for the help, much appreciated.

mp



More information about the Python-list mailing list