Retrieving Info From Web W/ Python
Anand Pillai
pythonguy at Hotpop.com
Sat May 10 10:50:47 EDT 2003
And if you are behind a proxy/firewall then you cannot even
go for httplib, but need to pull in the suppport of urllib/
urllib2 modules.
There is a module available that does Cookie management
transparently on top of urllib2 module. The module is called
"ClientCookie". You can browse the code at this site.
http://wwwsearch.sourceforge.net/ClientCookie/
Anand Pillai
noah at noah.org (Noah) wrote in message news:<c9d82136.0305091223.444dce9 at posting.google.com>...
> "Laurence Spector" <laurence at trdlnk.com> wrote in message news:<mailman.1052486416.17716.python-list at python.org>...
> > I am new to Python, and I noticed there is a geturl() function that gets
> > the contents of a web address. I am trying to get Python to do this,
> > except first it has to input a username, password, and then press
> > log-in. Then it needs to click another link. And finally, print the web
> > page. How do I get Python to "click" links and input information on the
> > web in order to get to dynamically generated web pages?
> >
> > I'd appreciate if anyone has any ideas on how to make such "web macros"
> > with Python. I assume it uses the CGI module, but the instructions only
> > seem to indicate how to take data from web pages and create new web
> > pages that incorporate it. Thanks,
> >
> > Laurence
>
> This can be complicated. Sorry this will probably not be encouraging
> because it sounds like you are also new to HTTP.
>
> You have to look at the HTML source of the site you are trying to
> interface with. The HTML Form will tell you where a form is
> submitted in the ACTION attribute of the form tag.
> Clicking submit is like going to the URL of the form action.
> Then you have to encode your form input variables in order
> to submit them. How you do this depends on whether the
> form actions calls a GET or POST CGI (or both). For this
> you need something lower level than urlget. You need to
> be able to edit your header before you call the remote CGI.
> This generic example shows how you might POST a form to a CGI.
> As you can see you are looking at some learning
> import httplib
> import urllib
> ex_form_var_a = "Noah"
> ex_form_var_b = "1"
> http = httplib.HTTPConnection ('www.example.com')
> params = urllib.urlencode ({'foo':ex_form_var_a,'bar':ex_form_var_b})
> http.putrequest('POST','/cgi/some_cgi')
> http.putheader("Content-type", "application/x-www-form-urlencoded")
> http.putheader("Content-length", "%d" % len(params))
> http.endheaders()
> http.send(params)
> response = http.getresponse ()
> I wont even get into cookies unless you are really interested!
> But it's basically the same type of header manipulation before
> you send a request to a web site, plus some response header
> parsing to pull out the cookies.
>
> Are you connecting with Secure HTTP (https://)?
> If so then you will have to make sure that your version
> of Python support SSL (only the newer versions do) or
> you will have to try M2Crypto.
> See if your version of Python supports HTTPS. Start up
> your Python interpreter and type this:
> >>> import socket
> >>> hasattr (socket, "ssl")
> 1
> If you get a 1 then you are good to go for HTTPS.
> If you get a 0 then you are screwed.
> Most UNIX platforms will have SSL. On Windows the ActiveState version
> does not have SSL, but the version that comes with Cygwin does have it.
> Then see if you can connect to an https site:
> import httplib
> HOSTNAME = 'login.yahoo.com'
> conn = httplib.HTTPSConnection(HOSTNAME)
> conn.putrequest('GET', '/')
> conn.endheaders()
> response = conn.getresponse()
> print response.read()
>
> Can you determine what type of authentication the target web site is doing?
> Is it doing "basic authentication" (basic auth)?
> Or it might be just accepting user name and password through a standard
> HTML form.
>
> Does the target site use some sort of session management
> to keep track of your credentials after you login?
> For example, cookies or mangled URLs.
>
> It's unfortunate that it is this complicated, but that is the
> nature of authentication. I wonder considering writing a framework
> to make this easier (logging in and managing cookies), but even
> after all that you still have to hack the target web site's HTML.
> With all the complicated HTML, Javascript, frames, and redirects
> that modern web sites use it can be very challenging to dissect
> the minimum that a site needs to authenticate you.
>
> Yours,
> Noah
More information about the Python-list
mailing list