RegEx conditional search and replace

Thu Jul 6 03:32:46 EDT 2006

"Juho Schultz" <juho.schultz at pp.inet.fi> wrote in message 
news:1152125582.577222.95010 at l70g2000cwa.googlegroups.com...
> Martin Evans wrote:
>> Sorry, yet another REGEX question.  I've been struggling with trying to 
>> get
>> a regular expression to do the following example in Python:
>>
>> Search and replace all instances of "sleeping" with "dead".
>>
>> This parrot is sleeping. Really, it is sleeping.
>> to
>> This parrot is dead. Really, it is dead.
>>
>>
>> But not if part of a link or inside a link:
>>
>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, 
>> it
>> is sleeping.
>> to
>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, 
>> it
>> is dead.
>>
>>
>> This is the full extent of the "html" that would be seen in the text, the
>> rest of the page has already been processed. Luckily I can rely on the
>> formating always being consistent with the above example (the url will
>> normally by much longer in reality though). There may though be more than
>> one link present.
>>
>> I'm hoping to use this to implement the automatic addition of links to 
>> other
>> areas of a website based on keywords found in the text.
>>
>> I'm guessing this is a bit too much to ask for regex. If this is the 
>> case,
>> I'll add some more manual Python parsing to the string, but was hoping to
>> use it to learn more about regex.
>>
>> Any pointers would be appreciated.
>>
>> Martin
>
> What you want is:
>
> re.sub(regex, replacement, instring)
> The replacement can be a function. So use a function.
>
> def sleeping_to_dead(inmatch):
>  instr = inmatch.group(0)
>  if needsfixing(instr):
>    return instr.replace('sleeping','dead')
>  else:
>    return instr
>
> as for the regex, something like
> (<a)?[^<>]*(</a>)?
> could be a start. It is probaly better to use the regex to recognize
> the links as you might have something like
> sleeping.sleeping/sleeping/sleeping.html in your urls. Also you
> probably want to do many fixes, so you can put them all within the same
> replacer function.

Many thanks for that, the function method looks very useful. My first 
working attempt had been to use the regex to locate all links. I then looped 
through replacing each with a numbered dummy entry. Then safely do the 
find/replaces and then replace the dummy entries with the original links. It 
just seems overly inefficient.