Regex - where do I make a mistake?

Carsten Haese carsten at uniqsys.com
Fri Feb 16 09:20:00 EST 2007


On Fri, 2007-02-16 at 05:34 -0800, Johny wrote:
> On Feb 16, 2:14 pm, Peter Otten <__pete... at web.de> wrote:
> > Johny wrote:
> > > I have
> > > string="""<span class="test456">55</span>.
> > > <td><span class="test123">128</span>
> > > <span class="test789">170</span>
> > > """
> >
> > > where I need to replace
> > > <span class="test456">55</span>.
> > > <span class="test789">170</span>
> >
> > > by space.
> > > So I tried
> >
> > > #############
> > > import re
> > > string="""<td><span class="test456">55</span>.<span
> > > class="test123">128</span><span class="test789">170</span>
> > > """
> > > Newstring=re.sub(r'<span class="test(?!123)">.*</span>'," ",string)
> > > ###########
> >
> > > But it does NOT work.
> > > Can anyone explain why?
> >
> > "(?!123)" is a negative "lookahead assertion", i. e. it ensures that "test"
> > is not followed by "123", but /doesn't/ consume any characters. For your
> > regex to match "test" must be /immediately/ followed by a '"'.
> >
> > Regular expressions are too lowlevel to use on HTML directly. Go with
> > BeautifulSoup instead of trying to fix the above.
> >
> Yes, I know "(?!123)" is a negative "lookahead assertion",
> but do not know excatly why it does not work.

It *does* work, it just doesn't do what you think it does.

The lookahead assertion is a zero-width match that doesn't match any
actual characters from the subject. It matches an imaginary vertical
line between two consecutive characters of the subject.

Nothing in your pattern matches the string of digits that follows
"test", hence the subject fails to match the pattern.

Also, please note Peter's advice that Regular Expressions are almost
always the wrong tool for working with HTML. It may work in very limited
cases, and maybe you have such a limited case, but you'd better make
sure that you'll never ever have to handle anything beyond this limited
case.

-Carsten





More information about the Python-list mailing list