Can not get urllib.urlopen to work

Wed Oct 27 19:22:33 EDT 2004

Sean Berry wrote:

> A few years back I needed to get information about the weather based on
zip
> code.  I used to use weather.com but they fixed the hole.  They made any
> port 80 request forward to another address, which in turn forwarded you to
> the original request.  So if you west to www.weather.com, it would forward
> you to www2.weather.com, then back to www.weather.com.  The index page at
> www.weather.com would then check the referrer to see if it came from
> www2.weather.com.  If the referrer was correct, then you got the page
> content.  If it was wrong, there was no page.
>
> Similarly, www2.weather.com would check to see that the referrer was
> www.weather.com... so there was no way around it.

Au contraire, monsieur...  there is no reason I can't set the referrer
header (using httplib).

> I use another method to protect my programs... but it still does the same
> thing ultimatly... stops people from using my programs.

    It is a level of discouragement, and may be sufficient to stop the
simplest of abuses, but the reality is that, given enough work, one can have
their Python program recreate any HTTP transaction or series of transactions
needed. If you want to redirect me to a dynamically generated URL, fine...
I'll read the Location header and go there, if you want to set a cookie
there, fine, I'll read the Set-Cookie header, set it, then return it to you,
just as a browser does. If you want to check the User-Agent, I can send
headers that look exactly like what MSIE or any other browser would send.
You can make it difficult, no doubt, and doing so may be a wise thing to do,
but it is wrong to say that just becasue A sends you to B, and B refers you
back to A, there is no way around it.

    I am currently doing exactly this sort of thing, but not to abuse
others' work. We pay for a service where data is published on a web site. We
need to get at this data several times an hour, 24X7. There is a login page,
and one or more HTML forms that must be filled out and submitted, and
cookies set and checked to get to the pages that contain the data we need.
It was not designed to be machine read, but that doesn't mean reading it in
an automated manner is "wrong" or abusive - we're paying for that data, they
just happen to present it in a format that's not all that conducive to
automated machine parsing. Fine, I'll do the extra work to get at it in that
manner.

    I suppose you could meter the number of requests that a particular IP is
making or something on the server side, and there isn't much that can be
done on the requester's end. With access to several machines, though, even
this hurdle could be cleared.

    So, anyway, if the OP wants to get more sophisticated about automated
surfing, check out the httplib module:
http://docs.python.org/lib/module-httplib.html

    FYI... www.dictionary.com loads (for me, now) through a simple telnet
request:

ej at sand:~/src/python> telnet www.dictionary.com 80
Trying 66.161.12.81...
Connected to www.dictionary.com.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.1 200 OK
Date: Wed, 27 Oct 2004 22:57:49 GMT
Server: Apache
Cache-Control: must-revalidate
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Last-Modified: Wed, 27 Oct 2004 22:58:07 GMT
Pragma: no-cache
Connection: close
Content-Type: text/html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">
<html>

<snip!>

HTH,
-ej