Using Python 2.1 to download asp www pages

Hans Nowak wurmy at earthlink.net
Sun Jan 6 16:40:50 EST 2002


Zugz wrote:
> 
> Hi,
> 
> I've recently written some Python code to extract some details about posting
> frequency etc from a board I use regularly.
> 
> I used IE5.5's Save As to give me some pages to work on offline.
> 
> I would now like to automate the whole process by downloading all the
> relevant pages or maybe even just accessing them direct.
> 
> If I use urlopen on a regular .htm page, in this case from the collection of
> links I call my www site, then things work as you would expect. You get the
> html source:
> 
> >>>
> a=urllib.urlopen("http://www.zugz.btinternet.co.uk/NonSFBooksBookshops.htm")
> >>> print a.read()
> 
> as you would hope.
> 
> However if I access one of the pages of interest, which all have the same
> form as below but with the a varying last page number:
> 
> >>>
> a=urllib.urlopen("http://boards.gamers.com/messages/overview.asp?name=panthe
> r_xl&page=2")
> >>> print a.read()
> 
> Then you do not get the page source but some HTML about the page being
> moved.

This is what I get in an interactive session:

>>> u = urllib.urlopen("http://boards.gamers.com/messages/overview.asp?name=panther_xl&page=2")
>>> u.url
'http://boards.gamers.com/user/profiling/login/cookieread.asp?action=read&dest_url=%2Fmessages%2Foverview%2Easp%3Fname%3Dpanther%5Fxl%26page%3D2'

If u's .url attribute is different from the URL you told it to
connect to, then this usually means a redirection. In this
case, it seems that you have to log in, and/or that a cookie
is expected. urllib.urlopen is not a mini-browser, and thus
does not implicitly use cookies.

You may want to look into the Cookie.py module (in the standard
library). Not sure if that will help you here though, because
I've never used it myself. :-/

--Hans (base64.decodestring('d3VybXlAZWFydGhsaW5rLm5ldA==') 
       # decode for email address ;-)
Site:: http://www.awaretek.com/nowak/



More information about the Python-list mailing list