Using a function for regular expression substitution

Mon Aug 30 09:31:33 EDT 2010

On Aug 30, 8:52 am, naugiedoggie <michael.a.p... at gmail.com> wrote:
> On Aug 29, 1:14 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
>
>
>
>
>
> > On 29/08/2010 15:22, naugiedoggie wrote:
> > > I'm having a problem with using a function as the replacement in
> > > re.sub().
> > > Here is the function:
> > > def normalize(s) :
> > >      return
> > > urllib.quote(string.capwords(urllib.unquote(s.group('provider'))))
>
> > This normalises the provider and returns only that, and none of the
> > remainder of the string.
>
> > I think you might want this:
>
> > def normalize(s):
> >      return s[ : s.start('provider')] +
> > urllib.quote(string.capwords(urllib.unquote(s.group('provider')))) +
> > s[s.start('provider') : ]
>
> > It returns the part before the provider, followed by the normalised
> > provider, and then the part after the provider.
>
> Hello,
>
> Thanks for the reply.
>
> There must be something basic about the re.sub() function that I'm
> missing.  The documentation shows this example:
>
> <code>>>> def dashrepl(matchobj):
>
> ...     if matchobj.group(0) == '-': return ' '
> ...     else: return '-'>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
> 'pro--gram files'
> >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
>
> 'Baked Beans & Spam'
> </code>
>
> According to the doc, the modifying function takes one parameter, the
> MatchObject.  The re.sub function takes only a compiled regex object
> or a pattern, generates a MatchObject from that object/pattern and
> passes the MatchObject to the given function. Notice that in the
> examples, the re.sub() returns the entire line, with the changes made.
> But the function itself returns only the change.  What is happening
> for me is that, if I have a line that contains
> &Search_Provider=chen&p=value, the processed line ends up with
> &Chen&p=value.
>
> Now, I did follow up with your suggestion.  `s' is actually a
> MatchObject (bad param naming on my part, I started out passing a
> string into the function and then changed it to a MatchObject, but
> didn't change the param name), so I made the following change:
>
> <code>
> return line[s.pos : s.start('provider')] + \
>
> urllib.quote(string.capwords(urllib.unquote(s.group('provider')))) + \
>         line[s.end('provider') : ]
> </code>
>
> In order to make this work (finally), I had to make the processing
> function look like this:
>
> <code>
> def processLine(l) :
>         global line
>         line = l
>         provider = getProvider(line)
>         if provider == "No Provider" : return line
>         scenario = getScenario(line)
>         if filter (lambda a: a != None, [getOrg(s,scenario) for s in
> orgs]) == [] :
>             line = re.sub(provider_pattern,normalize,line)
>         else :
>             line.replace(provider_parameter, org_parameter)
>         return line
> </code>
>
> And then the call:
>
> <code>
> lines = fileReader.readlines()
> [ fileWriter.write(l) for l in [processLine(l) for l in lines]]
> </code>
>
> Without this complicated gobbledigook, I could not get the correct
> result.  I hate global vars and I completely do not understand why I
> have to go through this twisting and turning to get the desired
> result.
>
> [ ... ]
>
> > These can be replaced by:
>
> >         if 'Search_Type' in line and 'Search_Provider' in line:
>
> > >            re.sub(provider_matcher,normalize,line)
>
> > re.sub is returning the result, which you're throwing away!
>
> >                 line = re.sub(provider_matcher,normalize,line)
>
> I can't count the number of times I have forgotten the meaning of
> 'returns a string' when reading docs about doing substitutions. In
> this case, I had put the `line = ' in and taken it out.  And I should
> know better, from years of programming in Java, where strings are
> immutable and you _always_ get a new, returned string.  Should be
> second nature.
>
> Thanks for the help, much appreciated.
>
> mp

Hello,

Well, that turned out to be still wrong.  I did start getting the
proper param=value back from my `normalize' function, but I got
"extra" data as well.

This works:

<code>
def normalize(s) :
    return s.group('search')
+'='+urllib.quote(string.capwords(urllib.unquote(s.group('provider'))))
</code>

Essentially, the pattern contained two groups, one identifying the
parameter name and one the value.  By concat'ing the two back
together, I was able to achieve the desired result.

I suppose the lesson is, the function replaces the entire match rather
than just the specified text captured.

Thanks.

mp