[Tutor] Visiting a URL

Sun Nov 3 20:30:02 2002

On Sat, 2 Nov 2002, John Abbe wrote:

> At 12:29 AM -0800 on 2002-11-01, Danny Yoo typed:
> >On Thu, 31 Oct 2002, John Abbe wrote:
> >  > I've got a newbie question -- i'm looking at altering PikiePikie to
> >>  notify weblogs.com when my weblog updates. I could get all involved in
> >>  XML-RPC, but it's doable through a plain URL. How do i visit a URL in
> >>  Python?
> >
> >Hi John,
> >
> >Do you mean retrieving the contents of the URL resource?  If so, there's a
> >nice module called 'urllib' that allows us to open URLs as if they were
> >files.
> >
> >     http://python.org/doc/lib/module-urllib.html
>
> Very cool. Thanks! Even that may be a little overkill. All i need to do
> is visit the URL; i don't need the response.

By "response", I'll assume that you mean that you don't want to look at
the body of the web request; all we may want to look at are the headers of
the response.

For that, we can use the 'httplib' http-client module:

    http://www.python.org/doc/lib/module-httplib.html

For example:

###
>>> import httplib
>>> connection = httplib.HTTPConnection('python.org')
>>> connection.request('GET', 'index.html')
>>> response = connection.getresponse()
>>> response.status
400
>>> response.reason
'Bad Request'
>>>
>>>                   ## Oops, forgot to put a '/' in front!
>>>                   ## (actually, most web browsers will do this
>>>                   ## correction for us!)
>>>
>>> connection.request('GET', '/index.html')
>>> response = connection.getresponse()
>>> response.status
200
>>> response.reason
'OK'
>>> response.getheader('last-modified')
'Tue, 29 Oct 2002 22:51:06 GMT'
###

So using httplib would be one way of visiting that url without actually
downloading the whole page.

We can wrap this all up into a tidy function:

###
>>> def getLastModified(url):
...     scheme, location, path, params, query, fragment = \
...         urlparse.urlparse(url)
...     connection = httplib.HTTPConnection(location)
...     connection.request('GET', path)
...     return connection.getresponse().getheader('last-modified')
...
>>> import urlparse
>>> getLastModified('http://python.org/')
'Tue, 29 Oct 2002 22:51:06 GMT'
###

The function above is a bit sloppy: it only handles standard http
connections, and it isn't doing much error checking at that.  But it's
something we can build on.

Hope this helps!