How to use urllib2.BaseHandler class

Sun Jul 11 19:07:26 EDT 2004

writeson at charter.net (Doug Farrell) writes:

> Hi all,
> 
> I'm trying to build a web page crawler to help us build our websites,
> which are driven by static pages after they are called the first time.
> Anyway, I can use urllib2.urlopen() no problem, but I'd like to have
> more control over the process. In particular I'd like to get back the
> HTTP status code from the request, even if it's a 200. It looks like I
> can do that by deriving my own class from HTTPHandler, but I'm not
> sure how to go about it. Can anyone direct me to some useful example
> code for this kind of thing?

In 2.3, urllib2 only ever *returns* a response if the code is 200.  In
other cases, HTTPError exceptions are *raised*.  HTTPError instances
satisfy the normal response interface, so you can catch them and use
them just as you would the return value of urlopen().  As you've
noticed, they also have .code and .msg attributes (unlike normal
response objects, in 2.3 -- since it's always 200, they weren't really
necessary!).

Now for 2.4, where things have changed a bit.

I *think* the 2.4 CVS urllib2.py will work fine with Python 2.3 (the
annoying Python test suite runner makes it a mild pain to check).

As I mentioned in another thread, don't use the urllib2 from 2.4a1 --
it's broken.

In 2.4, some successful responses other than 200 are also returned (at
present, only 200 and 206).  Also, all response objects have .code and
.msg attributes -- not only HTTPError, but those that get returned,
too (ie. 200 and 206 ATM).  If you want all responses returned rather
than raised as exceptions, or vice-versa, it's much easier to achieve
that in 2.4 than in 2.3.  It's easier because the interface of handler
objects has been extended to allow pre- post-processing of requests
and responses respectively, and that feature is now used by urllib2 to
implement HTTP error handling separately from the rest of HTTP
fetching.  Snip from CVS urllib2.py:

class HTTPErrorProcessor(BaseHandler):
    """Process HTTP error responses."""
    handler_order = 1000  # after all other processing

    def http_response(self, request, response):
        code, msg, hdrs = response.code, response.msg, response.info()

        if code not in (200, 206):
            response = self.parent.error(
                'http', request, response, code, msg, hdrs)

        return response

    https_response = http_response

So, to get all responses returned without error handling, regardless
of error code (this will disable things like authentication and
redirection, of course, so you might want to be a bit more
restrictive, by still passing on selected error codes to
self.parent.error()):

import urllib2

class NullHTTPErrorProcessor(urllib2.HTTPErrorProcessor):
    def http_response(self, request, response):
        return response

    https_response = http_response

opener = urllib2.build_opener(NullHTTPErrorProcessor())
opener.open("http://www.python.org/")  # never raises HTTPError

You should probably only do this if you have good reason, because you
may confuse people reading your code.

Use urllib2.install_opener() if you want to use urllib2.urlopen().
Usually there's no real point, though.

If you want to stick with pre-2.4 code, look at ClientCookie for
example code.  That code is full of cruft though, since it's supposed
to work back to 1.5.2, and has to cut and paste a fair amount as a
result ;-)

HTH

John