Regex - where do I make a mistake?

Peter Otten __peter__ at web.de
Fri Feb 16 08:50:04 EST 2007


Johny wrote:

> On Feb 16, 2:14 pm, Peter Otten <__pete... at web.de> wrote:
>> Johny wrote:
>> > I have
>> > string="""<span class="test456">55</span>.
>> > <td><span class="test123">128</span>
>> > <span class="test789">170</span>
>> > """
>>
>> > where I need to replace
>> > <span class="test456">55</span>.
>> > <span class="test789">170</span>
>>
>> > by space.
>> > So I tried
>>
>> > #############
>> > import re
>> > string="""<td><span class="test456">55</span>.<span
>> > class="test123">128</span><span class="test789">170</span>
>> > """
>> > Newstring=re.sub(r'<span class="test(?!123)">.*</span>'," ",string)
>> > ###########
>>
>> > But it does NOT work.
>> > Can anyone explain why?
>>
>> "(?!123)" is a negative "lookahead assertion", i. e. it ensures that
>> "test" is not followed by "123", but /doesn't/ consume any characters.
>> For your regex to match "test" must be /immediately/ followed by a '"'.
>>
>> Regular expressions are too lowlevel to use on HTML directly. Go with
>> BeautifulSoup instead of trying to fix the above.
>>
>> Peter- Hide quoted text -
>>
>> - Show quoted text -
> 
> Yes, I know "(?!123)" is a negative "lookahead assertion",
> but do not know excatly why it does not work.I thought that
> 
> (?!...)
> Matches if ... doesn't match next.  For example, Isaac (?!Asimov) will
> match 'Isaac ' only if it's not followed by 'Asimov'.

The problem is that your regex does not end with the lookahead assertion and
there is nothing to consume the '456' or '789'. To illustrate:

>>> for example in ["before123after", "before234after", "beforeafter"]:
...     re.findall("before(?!123)after", example)
...
[]
[]
['beforeafter']
>>> for example in ["before123after", "before234after", "beforeafter"]:
...     re.findall(r"before(?!123)\d\d\dafter", example)
...
[]
['before234after']
[]

Peter



More information about the Python-list mailing list