[Tutor] Re: Regex [example:HTMLParser, unittest, StringIO]

Mon Aug 25 15:11:02 EDT 2003

On Mon, 25 Aug 2003, Andrei wrote:

> Perhaps I should have explained my goal more clearly: I wish to take a
> piece of text which may or may not contain HTML tags and turn any piece
> of text which is NOT a link, but is an URL into a link. E.g.:
>
>    go to <a href="http://home.com">http://home.com</a>. [1]
>    go <a href="http://home.com">home</a>. [2]
>
> should remain unmodified, but
>
>    go to http://home.com [3]
>
> should be turned into [1]. That negative lookbehind can do the job in
> the large majority of the cases (by not matching URLs if they're
> preceded by single or double quotes or by ">"), but not always since it
> doesn't allow the lookbehind to be non-fixed length. I think one of the
> parser modules might be able to help (?) but regardless of how much I
> try, I can't get the hang of them, while I do somewhat understand
> regexes.

Hi Andrei,

Hmm... The example in:

    http://mail.python.org/pipermail/tutor/2003-August/024902.html

should be really close to what you're looking for.

Here's another example that shows how to use the handle_starttag() and
handle_endtag() methods.  The example also shows how we can use "unit
tests" to make sure our class is doing the right thing.

###
class Parser(HTMLParser.HTMLParser):
    """A small example for HTMLParser that pays attention to anchored
       text."""
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.in_anchor = False

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.in_anchor = True

    def handle_endtag(self, tag):
        if tag == 'a':
            self.in_anchor = False

    def handle_data(self, data):
        if self.in_anchor:
            print "Anchored text:", data

"""Here a small unit test to see if it's actually working."""
import sys
import unittest
from cStringIO import StringIO

class TestParser(unittest.TestCase):
    def setUp(self):
        self.buffer = StringIO()
        sys.stdout = self.buffer

    def testParsing(self):
        text = """<html><body>
                  This is a <a>test</a>
                  What is <a>thy bidding,</a> my master?"""
        parser = Parser()
        parser.feed(text)
        self.assertEqual('Anchored text: test\n'
                         + 'Anchored text: thy bidding,\n',
                         self.buffer.getvalue())

if __name__ == '__main__':
    unittest.main()
###

Warning: the code above is not really that well designed.  *grin*

We can see this already because the testing of the class is awkward: I'm
forced to do some calistenics --- I'm redirecting standard output in order
to reliably test the parser class --- and that's a little ugly.

It would be much better if we redesign the Parser to make it easier to
test.  Perhaps something like:

###
class TestParser(unittest.TestCase):
    def setUp(self):
        self.parser = Parser()

    def testParsing(self):
        text = """<html><body>
                  This is a <a>test</a>
                  What is <a>thy bidding,</a> my master?"""
        parser.feed(text)
        self.assertEqual('Anchored text: test\n'
                         + 'Anchored text: thy bidding,\n',
                         parser.getAnchoredText())
###

Anyway, please feel free to ask more questions about HTMLParser.  Hope
this helps!