RegEx conditional search and replace
Martin Evans
martin at browns-nospam.co.uk
Thu Jul 6 03:32:46 EDT 2006
"Juho Schultz" <juho.schultz at pp.inet.fi> wrote in message
news:1152125582.577222.95010 at l70g2000cwa.googlegroups.com...
> Martin Evans wrote:
>> Sorry, yet another REGEX question. I've been struggling with trying to
>> get
>> a regular expression to do the following example in Python:
>>
>> Search and replace all instances of "sleeping" with "dead".
>>
>> This parrot is sleeping. Really, it is sleeping.
>> to
>> This parrot is dead. Really, it is dead.
>>
>>
>> But not if part of a link or inside a link:
>>
>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really,
>> it
>> is sleeping.
>> to
>> This parrot <a href="sleeping.htm" target="new">is sleeping</a>. Really,
>> it
>> is dead.
>>
>>
>> This is the full extent of the "html" that would be seen in the text, the
>> rest of the page has already been processed. Luckily I can rely on the
>> formating always being consistent with the above example (the url will
>> normally by much longer in reality though). There may though be more than
>> one link present.
>>
>> I'm hoping to use this to implement the automatic addition of links to
>> other
>> areas of a website based on keywords found in the text.
>>
>> I'm guessing this is a bit too much to ask for regex. If this is the
>> case,
>> I'll add some more manual Python parsing to the string, but was hoping to
>> use it to learn more about regex.
>>
>> Any pointers would be appreciated.
>>
>> Martin
>
> What you want is:
>
> re.sub(regex, replacement, instring)
> The replacement can be a function. So use a function.
>
> def sleeping_to_dead(inmatch):
> instr = inmatch.group(0)
> if needsfixing(instr):
> return instr.replace('sleeping','dead')
> else:
> return instr
>
> as for the regex, something like
> (<a)?[^<>]*(</a>)?
> could be a start. It is probaly better to use the regex to recognize
> the links as you might have something like
> sleeping.sleeping/sleeping/sleeping.html in your urls. Also you
> probably want to do many fixes, so you can put them all within the same
> replacer function.
Many thanks for that, the function method looks very useful. My first
working attempt had been to use the regex to locate all links. I then looped
through replacing each with a numbered dummy entry. Then safely do the
find/replaces and then replace the dummy entries with the original links. It
just seems overly inefficient.
More information about the Python-list
mailing list