RegEx conditional search and replace

Thu Jul 6 08:18:48 EDT 2006

On Thu, 06 Jul 2006 08:32:46 +0100, Martin Evans wrote:

> "Juho Schultz" <juho.schultz at pp.inet.fi> wrote in message 
> news:1152125582.577222.95010 at l70g2000cwa.googlegroups.com...
>> Martin Evans wrote:
>>> Sorry, yet another REGEX question.  I've been struggling with trying to 
>>> get
>>> a regular expression to do the following example in Python:
>>>
>>> Search and replace all instances of "sleeping" with "dead".
>>>
>>> This parrot is sleeping. Really, it is sleeping.
>>> to
>>> This parrot is dead. Really, it is dead.
>>>
>>>
>>> But not if part of a link or inside a link:
>>>
>>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, 
>>> it
>>> is sleeping.
>>> to
>>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really, 
>>> it
>>> is dead.
>>>
>>>
>>> This is the full extent of the "html" that would be seen in the text, the
>>> rest of the page has already been processed. Luckily I can rely on the
>>> formating always being consistent with the above example (the url will
>>> normally by much longer in reality though). There may though be more than
>>> one link present.
>>>
>>> I'm hoping to use this to implement the automatic addition of links to 
>>> other
>>> areas of a website based on keywords found in the text.
>>>
>>> I'm guessing this is a bit too much to ask for regex. If this is the 
>>> case,
>>> I'll add some more manual Python parsing to the string, but was hoping to
>>> use it to learn more about regex.
>>>
>>> Any pointers would be appreciated.
>>>
>>> Martin
>>
>> What you want is:
>>
>> re.sub(regex, replacement, instring)
>> The replacement can be a function. So use a function.
>>
>> def sleeping_to_dead(inmatch):
>>  instr = inmatch.group(0)
>>  if needsfixing(instr):
>>    return instr.replace('sleeping','dead')
>>  else:
>>    return instr
>>
>> as for the regex, something like
>> (<a)?[^<>]*(</a>)?
>> could be a start. It is probaly better to use the regex to recognize
>> the links as you might have something like
>> sleeping.sleeping/sleeping/sleeping.html in your urls. Also you
>> probably want to do many fixes, so you can put them all within the same
>> replacer function.
> 
> ... My first 
> working attempt had been to use the regex to locate all links. I then looped 
> through replacing each with a numbered dummy entry. Then safely do the 
> find/replaces and then replace the dummy entries with the original links. It 
> just seems overly inefficient.

Someone may have made use of 
 multiline links:

_________________________
This parrot 
<a 
	href="sleeping.htm" 
	target="new"   >
is sleeping
</a>. 
Really, it is sleeping.
_________________________

In such a case you may need to make the page 
into one string to search if you don't want to use some complex
method of tracking state with variables as you move from
string to string.  You'll also have to make it possible
for non-printing characters to have been inserted in all sorts
of ways around the '>' and '<' and 'a' or 'A' 
characters using ' *' here and there in he regex.