RegEx conditional search and replace

Martin Evans martin at browns-nospam.co.uk
Thu Jul 6 10:23:34 EDT 2006


"mbstevens" <NOXwebmasterX at XmbstevensX.com> wrote in message 
news:pan.2006.07.06.12.19.42.943014 at XmbstevensX.com...
> On Thu, 06 Jul 2006 08:32:46 +0100, Martin Evans wrote:
>
>> "Juho Schultz" <juho.schultz at pp.inet.fi> wrote in message
>> news:1152125582.577222.95010 at l70g2000cwa.googlegroups.com...
>>> Martin Evans wrote:
>>>> Sorry, yet another REGEX question.  I've been struggling with trying to
>>>> get
>>>> a regular expression to do the following example in Python:
>>>>
>>>> Search and replace all instances of "sleeping" with "dead".
>>>>
>>>> This parrot is sleeping. Really, it is sleeping.
>>>> to
>>>> This parrot is dead. Really, it is dead.
>>>>
>>>>
>>>> But not if part of a link or inside a link:
>>>>
>>>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. 
>>>> Really,
>>>> it
>>>> is sleeping.
>>>> to
>>>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. 
>>>> Really,
>>>> it
>>>> is dead.
>>>>
>>>>
>>>> This is the full extent of the "html" that would be seen in the text, 
>>>> the
>>>> rest of the page has already been processed. Luckily I can rely on the
>>>> formating always being consistent with the above example (the url will
>>>> normally by much longer in reality though). There may though be more 
>>>> than
>>>> one link present.
>>>>
>>>> I'm hoping to use this to implement the automatic addition of links to
>>>> other
>>>> areas of a website based on keywords found in the text.
>>>>
>>>> I'm guessing this is a bit too much to ask for regex. If this is the
>>>> case,
>>>> I'll add some more manual Python parsing to the string, but was hoping 
>>>> to
>>>> use it to learn more about regex.
>>>>
>>>> Any pointers would be appreciated.
>>>>
>>>> Martin
>>>
>>> What you want is:
>>>
>>> re.sub(regex, replacement, instring)
>>> The replacement can be a function. So use a function.
>>>
>>> def sleeping_to_dead(inmatch):
>>>  instr = inmatch.group(0)
>>>  if needsfixing(instr):
>>>    return instr.replace('sleeping','dead')
>>>  else:
>>>    return instr
>>>
>>> as for the regex, something like
>>> (<a)?[^<>]*(</a>)?
>>> could be a start. It is probaly better to use the regex to recognize
>>> the links as you might have something like
>>> sleeping.sleeping/sleeping/sleeping.html in your urls. Also you
>>> probably want to do many fixes, so you can put them all within the same
>>> replacer function.
>>
>> ... My first
>> working attempt had been to use the regex to locate all links. I then 
>> looped
>> through replacing each with a numbered dummy entry. Then safely do the
>> find/replaces and then replace the dummy entries with the original links. 
>> It
>> just seems overly inefficient.
>
> Someone may have made use of
> multiline links:
>
> _________________________
> This parrot
> <a
> href="sleeping.htm"
> target="new"   >
> is sleeping
> </a>.
> Really, it is sleeping.
> _________________________
>
>
> In such a case you may need to make the page
> into one string to search if you don't want to use some complex
> method of tracking state with variables as you move from
> string to string.  You'll also have to make it possible
> for non-printing characters to have been inserted in all sorts
> of ways around the '>' and '<' and 'a' or 'A'
> characters using ' *' here and there in he regex.

I agree, but luckily in this case the HREF will always be formated the same 
as it happens to be inserted by another automated system, not a user. Ok it 
be changed by the server but AFAIK it hasn't for the last 6 years. The text 
in question apart from the links should be plain (not even <b> is allowed I 
think).

I'd read about back and forward regex matching and thought it might somehow 
be of use here ie back search for the "<A" and forward search for the "</A>" 
but of course this would easily match in between two links.






More information about the Python-list mailing list