Can't get the real contents form page in internet as the tag "no-chche"

Tim Roberts timr at probo.com
Thu Mar 23 03:39:10 EST 2006


"dongdong" <dongdonglove8 at hotmail.com> wrote:
>
>using web browser can get page's content formally, but when use
>urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()
>
>the result is
>
><html><head><META HTTP-EQUIV=REFRESH
>CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
><META http-equiv="Pragma"
>content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>
>
>,I think the reson is the no-cache, are there person would help me?

No, that's not the reason.  The reason is that this includes a redirect.

As an HTML consumer, you are supposed to parse that content and notice the
<meta http-equiv> tag, which says "here is something that should have been
one of the HTTP headers".

In this case, it wants you to act as though you saw:
    Refresh: 0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html
    Pragma: no-cache

In this case, the "Refresh" header means that you are supposed to go fetch
the contents of that new page immediately.  Try using urllib2.open on THAT
address, and you should get your content.

This is one way to handle a web site reorganization and still allow older
URLs to work.
-- 
- Tim Roberts, timr at probo.com
  Providenza & Boekelheide, Inc.



More information about the Python-list mailing list