[Tutor] Re: Regex [example:HTMLParser, unittest, StringIO]
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Mon Aug 25 15:11:02 EDT 2003
On Mon, 25 Aug 2003, Andrei wrote:
> Perhaps I should have explained my goal more clearly: I wish to take a
> piece of text which may or may not contain HTML tags and turn any piece
> of text which is NOT a link, but is an URL into a link. E.g.:
>
> go to <a href="http://home.com">http://home.com</a>. [1]
> go <a href="http://home.com">home</a>. [2]
>
> should remain unmodified, but
>
> go to http://home.com [3]
>
> should be turned into [1]. That negative lookbehind can do the job in
> the large majority of the cases (by not matching URLs if they're
> preceded by single or double quotes or by ">"), but not always since it
> doesn't allow the lookbehind to be non-fixed length. I think one of the
> parser modules might be able to help (?) but regardless of how much I
> try, I can't get the hang of them, while I do somewhat understand
> regexes.
Hi Andrei,
Hmm... The example in:
http://mail.python.org/pipermail/tutor/2003-August/024902.html
should be really close to what you're looking for.
Here's another example that shows how to use the handle_starttag() and
handle_endtag() methods. The example also shows how we can use "unit
tests" to make sure our class is doing the right thing.
###
class Parser(HTMLParser.HTMLParser):
"""A small example for HTMLParser that pays attention to anchored
text."""
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.in_anchor = False
def handle_starttag(self, tag, attrs):
if tag == 'a':
self.in_anchor = True
def handle_endtag(self, tag):
if tag == 'a':
self.in_anchor = False
def handle_data(self, data):
if self.in_anchor:
print "Anchored text:", data
"""Here a small unit test to see if it's actually working."""
import sys
import unittest
from cStringIO import StringIO
class TestParser(unittest.TestCase):
def setUp(self):
self.buffer = StringIO()
sys.stdout = self.buffer
def testParsing(self):
text = """<html><body>
This is a <a>test</a>
What is <a>thy bidding,</a> my master?"""
parser = Parser()
parser.feed(text)
self.assertEqual('Anchored text: test\n'
+ 'Anchored text: thy bidding,\n',
self.buffer.getvalue())
if __name__ == '__main__':
unittest.main()
###
Warning: the code above is not really that well designed. *grin*
We can see this already because the testing of the class is awkward: I'm
forced to do some calistenics --- I'm redirecting standard output in order
to reliably test the parser class --- and that's a little ugly.
It would be much better if we redesign the Parser to make it easier to
test. Perhaps something like:
###
class TestParser(unittest.TestCase):
def setUp(self):
self.parser = Parser()
def testParsing(self):
text = """<html><body>
This is a <a>test</a>
What is <a>thy bidding,</a> my master?"""
parser.feed(text)
self.assertEqual('Anchored text: test\n'
+ 'Anchored text: thy bidding,\n',
parser.getAnchoredText())
###
Anyway, please feel free to ask more questions about HTMLParser. Hope
this helps!
More information about the Tutor
mailing list