[Tutor] Regex [negative lookbehind / use HTMLParser to parse HTML]

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Sun Aug 24 22:00:56 EDT 2003



On Sun, 24 Aug 2003, Andrei wrote:

> I'm quite sure I've seen a question of this type before, but I seem
> unable to find it. How can I match a re pattern ONLY if it is not
> preceded by another re pattern?

Hi Andrei,

We can use the negative lookbehind "(?<! )" syntax:

    http://www.python.org/doc/lib/re-syntax.html

For example:

###
>>> regex = re.compile(r'(?<!foo)bar')
>>> regex.search('the chocolate bar melts')
<_sre.SRE_Match object at 0x1e8900>
>>> regex.search('foobar')
>>>
###



> Think for example of finding all URLs in a piece of text, but *not* if
> they are inside link tags and therefore preceded by 'href="'. <a
> href="http://python.org">Python</a> shouldn't give a match, but
> http://python.org on its own should.

The prerequisite "don't parse HTML with regular expressions alone" retort
is instinctively at the tip of my tongue.  *grin*


For this particular example, it's a better idea to use regular expressions
in concert with something like HTMLParser:

    http://www.python.org/doc/lib/module-HTMLParser.html

For example:

###
>>> regex = re.compile("(http://python.org)")
>>> text = """
... The python.org web site,
... <a href="http://python.org">http://python.org</a>
... is a great resource"""
>>>
>>> regex.findall(text)
['http://python.org', 'http://python.org']
###

Here we see the problem of grabbing http://python.org twice --- we'd like
to avoid looking at tag attributes.  To solve this, we can use a parser
that only pays attention to the non-tag data, and run our url-matching
regex on that:

###
>>> import HTMLParser
>>> class Parser(HTMLParser.HTMLParser):
...     def __init__(self):
...         HTMLParser.HTMLParser.__init__(self)
...         self.urls = []
...     def handle_data(self, data):
...         self.urls.extend(regex.findall(data))
...     def getUrls(self):
...         return self.urls
...
>>> p = Parser()
>>> p.feed(text)
>>> p.close()
>>> p.getUrls()
['http://python.org']
###

Hope this helps!




More information about the Tutor mailing list