urllib2.urlopen(url) pulling something other than HTML

Mon Aug 20 14:44:12 EDT 2007

I am reading "Python for Dummies" and found the following example of a
web crawler that I thought was interesting.  The first time I keyed
the program and executed it I didn't understand it well enough to
debug it so I just skipped it.  A few days later I realized that it
failed after a few seconds and I wanted to know if it was a
shortcoming of Python, a mistype on my part or just an inherent
problem with the script so I retyped it and started trying to figure
out what went wrong.

Please keep in mind I am very new to coding so I have tried RTFM
without much success.   I have a basic understanding of what the
application is doing but I want to understand WHY it is doing it or
what the rationale is for doing it.  Not necessarily how it does it..
In any case here is the gist of the app.

1 - a new spider is created
2 - it takes a single argument which is a web address (http://
www.google.com)
3 - the spider pulls a copy of the page source
4 - the spider parses it for links and if the link is on the same
domain and has not already been parsed then it appends the link to the
list of pages to be parsed

Being new I have a couple of questions that I am hoping someone can
answer with some degree of detail.

----------------------------------------------------------
f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
parser = htmllib.HTMLParser(f)
    parser.feed(html)
    parser.close()
    return parser.anchorlist
----------------------------------------------------------

I get the idea that we're allocating some memory that looks like a
file so formatter.dumbwriter can manipulate it.  The results are
passed to formatter.abstractformatter which does something else to the
HTML code.  The results are then passed to "f" which is then passed to
htmllib.HTMLParser so it can parse the html for links.   I guess I
don't understand with any great detail as to why this is happening.
I know someone is going to say that I should RTFM so here is the gist
of the documentation:

formatter.DumbWriter = "This class is suitable for reflowing a
sequence of paragraphs."
formatter.AbstractFormatter = "The standard formatter. This
implementation has demonstrated wide applicability to many writers,
and may be used directly in most circumstances. It has been used to
implement a full-featured World Wide Web browser." <-- huh?

So.. What is dumbwriter and abstractformatter doing with this HTML and
why does it need to be done before parser.feed() gets a hold of it?

The last question is..   I can't find any documentation to explain
where the "anchorlist" attribute came from?   Here is the only
reference to this attribute that I can find anywhere in the Python
documentation.

----------------------
anchor_bgn(  	href, name, type)
    This method is called at the start of an anchor region. The
arguments correspond to the attributes of the <A> tag with the same
names. The default implementation maintains a list of hyperlinks
(defined by the HREF attribute for <A> tags) within the document. The
list of hyperlinks is available as the data attribute anchorlist.
----------------------

So ..  How does an average developer figure out that parser returns a
list of hyperlinks in an attribute called anchorlist?  Is this
something that you just "figure out" or is there some book I should be
reading that documents all of the attributes for a particular
method?   It just seems a bit obscure and certainly not something I
would have figured out on my own.  Does this make me a poor developer
who should find another hobby?   I just need to know if there is
something wrong with me or if this is a reasonable question to ask.

The last question I have is about debugging.   The spider is capable
of parsing links until it reaches:

"html = get_page(http://www.google.com/jobs/fortune)" which returns
the contents of a pdf document, assigns the pdf contents to html which
is later passed to parser.feed(html) which crashes.

I'm smart enough to know that whenever you take in some input that you
should do some basic type checking to make sure that whatever you are
trying to manipulate (especially if it originates from outside of your
application) won't cause your application to crash.  If you're
expecting an ASCII character then make sure you're not getting an
object or string of text.

How would an experienced python developer check the contents of "html"
to make sure its not something else other than a blob of HTML code?  I
should note an obviously catch-22..   How do I check the HTML in such
a way that the check itself doesn't possibly crash the app?  I thought
about:

try:
    parser.feed(html)
except parser.HTMLParseError:
    parser.close()

.... but i'm not sure if that is right or not?  The app still crashes
so obviously i'm doing something wrong.

Here is the full app for your review.

Thank you for any help you can provide!  I greatly appreciate it!

#!/usr/bin/python

#these modules do most of the work
import sys
import urllib2
import urlparse
import htmllib, formatter
from cStringIO import StringIO

def log_stdout(msg):
    """Print msg to the screen."""
    print msg

def get_page(url, log):
    """Retrieve URL and return comments, log errors."""
    try:
        page = urllib2.urlopen(url)
    except urllib2.URLError:
        log("Error retrieving: " + url)
        return ''
    body = page.read()
    page.close()
    return body

def find_links(html):
    """return a list of links in HTML"""
    #We're using the parser just to get the hrefs
    f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
    parser = htmllib.HTMLParser(f)
    parser.feed(html)
    parser.close()
    return parser.anchorlist

class Spider:
    """
    The heart of this program, finds all links within a web site.

    run() contains the main loop.
    process_page() retrieves each page and finds the links.
    """

    def __init__(self, startURL, log=None):
        #this method sets initial values
        self.URLs = set() #create a set
        self.URLs.add(startURL) #add the start url to the set
        self.include = startURL
        self._links_to_process = [startURL]
        if log is None:
            #use log_stdout function if no log provided
            self.log = log_stdout
        else:
            self.log = log

    def run(self):
        #process list of URLs one at a time
        while self._links_to_process:
            url = self._links_to_process.pop()
            self.log("Retrieving: " + url)
            self.process_page(url)

    def url_in_site(self, link):
        #checks weather the link starts with the base URL
        return link.startswith(self.include)

    def process_page(self, url):
        #retrieves page and finds links in it
        html = get_page(url, self.log)
        for link in find_links(html):
            #handle relative links
            link = urlparse.urljoin(url,link)
            self.log("Checking: " + link)
            #make sure this is a new URL within current site
            if link not in self.URLs and self.url_in_site(link):
                self.URLs.add(link)
                self._links_to_process.append(link)

if __name__ == '__main__':
    #this code runs when script is started from command line
    startURL = sys.argv[1]
    spider = Spider(startURL)
    spider.run()
    for URL in sorted(spider.URLs):
        print URL