Retrieving Info From Web W/ Python

Noah noah at noah.org
Fri May 9 16:23:29 EDT 2003


"Laurence Spector" <laurence at trdlnk.com> wrote in message news:<mailman.1052486416.17716.python-list at python.org>...
> I am new to Python, and I noticed there is a geturl() function that gets
> the contents of a web address. I am trying to get Python to do this,
> except first it has to input a username, password, and then press
> log-in. Then it needs to click another link. And finally, print the web
> page. How do I get Python to "click" links and input information on the
> web in order to get to dynamically generated web pages?
>  
> I'd appreciate if anyone has any ideas on how to make such "web macros"
> with Python. I assume it uses the CGI module, but the instructions only
> seem to indicate how to take data from web pages and create new web
> pages that incorporate it. Thanks,
>  
> Laurence

This can be complicated. Sorry this will probably not be encouraging
because it sounds like you are also new to HTTP.

You have to look at the HTML source of the site you are trying to
interface with. The HTML Form will tell you where a form is
submitted in the ACTION attribute of the form tag.
Clicking submit is like going to the URL of the form action.
Then you have to encode your form input variables in order
to submit them. How you do this depends on whether the
form actions calls a GET or POST CGI (or both). For this
you need something lower level than urlget. You need to
be able to edit your header before you call the remote CGI.
This generic example shows how you might POST a form to a CGI.
As you can see you are looking at some learning
    import httplib
    import urllib
    ex_form_var_a = "Noah"
    ex_form_var_b = "1"
    http = httplib.HTTPConnection ('www.example.com')
    params = urllib.urlencode ({'foo':ex_form_var_a,'bar':ex_form_var_b})
    http.putrequest('POST','/cgi/some_cgi')
    http.putheader("Content-type", "application/x-www-form-urlencoded")
    http.putheader("Content-length", "%d" % len(params))
    http.endheaders()
    http.send(params)
    response = http.getresponse ()
I wont even get into cookies unless you are really interested!
But it's basically the same type of header manipulation before
you send a request to a web site, plus some response header
parsing to pull out the cookies.

Are you connecting with Secure HTTP (https://)?
If so then you will have to make sure that your version
of Python support SSL (only the newer versions do) or
you will have to try M2Crypto.
See if your version of Python supports HTTPS. Start up
your Python interpreter and type this:
    >>> import socket
    >>> hasattr (socket, "ssl")
    1
If you get a 1 then you are good to go for HTTPS.
If you get a 0 then you are screwed.
Most UNIX platforms will have SSL. On Windows the ActiveState version
does not have SSL, but the version that comes with Cygwin does have it.
Then see if you can connect to an https site:
    import httplib
    HOSTNAME = 'login.yahoo.com'
    conn = httplib.HTTPSConnection(HOSTNAME)
    conn.putrequest('GET', '/')
    conn.endheaders()
    response = conn.getresponse()
    print response.read()

Can you determine what type of authentication the target web site is doing? 
Is it doing "basic authentication" (basic auth)? 
Or it might be just accepting user name and password through a standard 
HTML form.

Does the target site use some sort of session management
to keep track of your credentials after you login?
For example, cookies or mangled URLs.

It's unfortunate that it is this complicated, but that is the
nature of authentication. I wonder considering writing a framework
to make this easier (logging in and managing cookies), but even
after all that you still have to hack the target web site's HTML.
With all the complicated HTML, Javascript, frames, and redirects
that modern web sites use it can be very challenging to dissect
the minimum that a site needs to authenticate you.

Yours,
Noah




More information about the Python-list mailing list