RegEx conditional search and replace

Wed Jul 5 14:53:03 EDT 2006

Martin Evans wrote:
> Sorry, yet another REGEX question.  I've been struggling with trying to get
> a regular expression to do the following example in Python:
>
> Search and replace all instances of "sleeping" with "dead".
>
> This parrot is sleeping. Really, it is sleeping.
> to
> This parrot is dead. Really, it is dead.
>
>
> But not if part of a link or inside a link:
>
> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, it
> is sleeping.
> to
> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, it
> is dead.
>
>
> This is the full extent of the "html" that would be seen in the text, the
> rest of the page has already been processed. Luckily I can rely on the
> formating always being consistent with the above example (the url will
> normally by much longer in reality though). There may though be more than
> one link present.
>
> I'm hoping to use this to implement the automatic addition of links to other
> areas of a website based on keywords found in the text.
>
> I'm guessing this is a bit too much to ask for regex. If this is the case,
> I'll add some more manual Python parsing to the string, but was hoping to
> use it to learn more about regex.
>
> Any pointers would be appreciated.
>
> Martin

What you want is:

re.sub(regex, replacement, instring)
The replacement can be a function. So use a function.

def sleeping_to_dead(inmatch):
  instr = inmatch.group(0)
  if needsfixing(instr):
    return instr.replace('sleeping','dead')
  else:
    return instr

as for the regex, something like
(<a)?[^<>]*(</a>)?
could be a start. It is probaly better to use the regex to recognize
the links as you might have something like
sleeping.sleeping/sleeping/sleeping.html in your urls. Also you
probably want to do many fixes, so you can put them all within the same
replacer function.