Retrieving Info From Web W/ Python

Anand Pillai pythonguy at Hotpop.com
Sat May 10 10:50:47 EDT 2003


And if you are behind a proxy/firewall then you cannot even 
go for httplib, but need to pull in the suppport of urllib/
urllib2 modules.

There is a module available that does Cookie management
transparently on top of urllib2 module. The module is called
"ClientCookie". You can browse the code at this site.

http://wwwsearch.sourceforge.net/ClientCookie/

Anand Pillai

noah at noah.org (Noah) wrote in message news:<c9d82136.0305091223.444dce9 at posting.google.com>...
> "Laurence Spector" <laurence at trdlnk.com> wrote in message news:<mailman.1052486416.17716.python-list at python.org>...
> > I am new to Python, and I noticed there is a geturl() function that gets
> > the contents of a web address. I am trying to get Python to do this,
> > except first it has to input a username, password, and then press
> > log-in. Then it needs to click another link. And finally, print the web
> > page. How do I get Python to "click" links and input information on the
> > web in order to get to dynamically generated web pages?
> >  
> > I'd appreciate if anyone has any ideas on how to make such "web macros"
> > with Python. I assume it uses the CGI module, but the instructions only
> > seem to indicate how to take data from web pages and create new web
> > pages that incorporate it. Thanks,
> >  
> > Laurence
> 
> This can be complicated. Sorry this will probably not be encouraging
> because it sounds like you are also new to HTTP.
> 
> You have to look at the HTML source of the site you are trying to
> interface with. The HTML Form will tell you where a form is
> submitted in the ACTION attribute of the form tag.
> Clicking submit is like going to the URL of the form action.
> Then you have to encode your form input variables in order
> to submit them. How you do this depends on whether the
> form actions calls a GET or POST CGI (or both). For this
> you need something lower level than urlget. You need to
> be able to edit your header before you call the remote CGI.
> This generic example shows how you might POST a form to a CGI.
> As you can see you are looking at some learning
>     import httplib
>     import urllib
>     ex_form_var_a = "Noah"
>     ex_form_var_b = "1"
>     http = httplib.HTTPConnection ('www.example.com')
>     params = urllib.urlencode ({'foo':ex_form_var_a,'bar':ex_form_var_b})
>     http.putrequest('POST','/cgi/some_cgi')
>     http.putheader("Content-type", "application/x-www-form-urlencoded")
>     http.putheader("Content-length", "%d" % len(params))
>     http.endheaders()
>     http.send(params)
>     response = http.getresponse ()
> I wont even get into cookies unless you are really interested!
> But it's basically the same type of header manipulation before
> you send a request to a web site, plus some response header
> parsing to pull out the cookies.
> 
> Are you connecting with Secure HTTP (https://)?
> If so then you will have to make sure that your version
> of Python support SSL (only the newer versions do) or
> you will have to try M2Crypto.
> See if your version of Python supports HTTPS. Start up
> your Python interpreter and type this:
>     >>> import socket
>     >>> hasattr (socket, "ssl")
>     1
> If you get a 1 then you are good to go for HTTPS.
> If you get a 0 then you are screwed.
> Most UNIX platforms will have SSL. On Windows the ActiveState version
> does not have SSL, but the version that comes with Cygwin does have it.
> Then see if you can connect to an https site:
>     import httplib
>     HOSTNAME = 'login.yahoo.com'
>     conn = httplib.HTTPSConnection(HOSTNAME)
>     conn.putrequest('GET', '/')
>     conn.endheaders()
>     response = conn.getresponse()
>     print response.read()
> 
> Can you determine what type of authentication the target web site is doing? 
> Is it doing "basic authentication" (basic auth)? 
> Or it might be just accepting user name and password through a standard 
> HTML form.
> 
> Does the target site use some sort of session management
> to keep track of your credentials after you login?
> For example, cookies or mangled URLs.
> 
> It's unfortunate that it is this complicated, but that is the
> nature of authentication. I wonder considering writing a framework
> to make this easier (logging in and managing cookies), but even
> after all that you still have to hack the target web site's HTML.
> With all the complicated HTML, Javascript, frames, and redirects
> that modern web sites use it can be very challenging to dissect
> the minimum that a site needs to authenticate you.
> 
> Yours,
> Noah




More information about the Python-list mailing list