[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

David Kim davidkim05 at gmail.com
Tue Jul 7 19:20:25 CEST 2009


On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnson<kent37 at tds.net> wrote:
>
> curl works because it ignores the redirect to the ToS page, and the
> site is (astoundingly) dumb enough to serve the content with the
> redirect. You could make urllib2 behave the same way by defining a 302
> handler that does nothing.

Many thanks for the redirect pointer! I also found
http://diveintopython.org/http_web_services/redirects.html. Is the
handler class on this page what you mean by a handler that does
nothing? (It looks like it exposes the error code but still follows
the redirect). I guess i'm still a little confused since, if the
handler does nothing, won't I still go to the ToS page?

For example, I ran the following code (found at
http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect)
and ended-up pulling the same ToS page anyway.

####
import urllib2

redirect_handler = urllib2.HTTPRedirectHandler()

class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        return urllib2.HTTPRedirectHandler.http_error_302(self, req,
fp, code, msg, headers)

    http_error_301 = http_error_303 = http_error_307 = http_error_302

cookieprocessor = urllib2.HTTPCookieProcessor()

opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
urllib2.install_opener(opener)

response = urllib2.urlopen("http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1")
print response.read()
####

I suspect I am not understanding something basic about how urllib2
deals with this redirect issue since it seems everything I try gives
me the same ToS page.

> Generally you have to post to the same url as the form, giving the
> same data the form does. You can inspect the source of the form to
> figure this out. In this case the form is
>
> <form method="post" action="/products/consent.php">
> <input type="hidden" value="tiwd/products/derivserv/data_table_i.php"
> name="urltarget"/>
> <input type="hidden" value="1" name="check_one"/>
> <input type="hidden" value="tiwdata" name="tag"/>
> <input type="submit" value="I Agree" name="acknowledgement"/>
> <input type="submit" value="Decline" name="acknowledgement"/>
> </form>
>
> You generally need to enable cookie support in urllib2 as well,
> because the site will use a cookie to flag that you saw the consent
> form. This tutorial shows how to enable cookies and submit form data:
> http://personalpages.tds.net/~kent37/kk/00010.html

I have seen the login examples where one provides values for the
fields username and password (thanks Kent). Given the form above,
however, it's unclear to me how one POSTs the form data when you
aren't actually passing any parameters. Perhaps this is less of a
Python question and more an http question (which unfortunately I know
nothing about either).

Thanks so much again for the help!

DK



--
morenotestoself.wordpress.com


More information about the Tutor mailing list