[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

Kent Johnson kent37 at tds.net
Tue Jul 7 19:56:45 CEST 2009


On Tue, Jul 7, 2009 at 1:20 PM, David Kim<davidkim05 at gmail.com> wrote:
> On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnson<kent37 at tds.net> wrote:
>>
>> curl works because it ignores the redirect to the ToS page, and the
>> site is (astoundingly) dumb enough to serve the content with the
>> redirect. You could make urllib2 behave the same way by defining a 302
>> handler that does nothing.
>
> Many thanks for the redirect pointer! I also found
> http://diveintopython.org/http_web_services/redirects.html. Is the
> handler class on this page what you mean by a handler that does
> nothing? (It looks like it exposes the error code but still follows
> the redirect).

No, all of those examples are handling the redirect. The
SmartRedirectHandler just captures additional status. I think you need
something like this:
class IgnoreRedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        return None

    def http_error_302(self, req, fp, code, msg, headers):
        return None

> I guess i'm still a little confused since, if the
> handler does nothing, won't I still go to the ToS page?

No, it is the action of the handler, responding to the redirect
request, that causes the ToS page to be fetched.

> For example, I ran the following code (found at
> http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect)

That is pretty similar to the DiP code...

> I suspect I am not understanding something basic about how urllib2
> deals with this redirect issue since it seems everything I try gives
> me the same ToS page.

Maybe you don't understand how redirect works in general...

>> Generally you have to post to the same url as the form, giving the
>> same data the form does. You can inspect the source of the form to
>> figure this out. In this case the form is
>>
>> <form method="post" action="/products/consent.php">
>> <input type="hidden" value="tiwd/products/derivserv/data_table_i.php"
>> name="urltarget"/>
>> <input type="hidden" value="1" name="check_one"/>
>> <input type="hidden" value="tiwdata" name="tag"/>
>> <input type="submit" value="I Agree" name="acknowledgement"/>
>> <input type="submit" value="Decline" name="acknowledgement"/>
>> </form>
>>
>> You generally need to enable cookie support in urllib2 as well,
>> because the site will use a cookie to flag that you saw the consent
>> form. This tutorial shows how to enable cookies and submit form data:
>> http://personalpages.tds.net/~kent37/kk/00010.html
>
> I have seen the login examples where one provides values for the
> fields username and password (thanks Kent). Given the form above,
> however, it's unclear to me how one POSTs the form data when you
> aren't actually passing any parameters. Perhaps this is less of a
> Python question and more an http question (which unfortunately I know
> nothing about either).

Yes, the parameters are listed in the form.

If you don't have at least a basic understanding of HTTP and HTML you
are going to have trouble with this project...

Kent


More information about the Tutor mailing list