Web page data and urllib2.urlopen

Piet van Oostrum piet at cs.uu.nl
Thu Aug 6 15:17:16 EDT 2009


>>>>> Dave Angel <davea at ieee.org> (DA) wrote:

>DA> Massi wrote:
>>> Hi everyone, I'm using the urllib2 library to get the html source code
>>> of web pages. In general it works great, but I'm having to do with a
>>> financial web site which does not provide the souce code I expect. As
>>> a matter of fact if you try:
>>> 
>>> import urllib2
>>> res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
>>> biggest-gaining-and-declining-stocks-2009-07-27")
>>> page = res.read()
>>> print page
>>> 
>>> you will see that the printed code is very different from the one
>>> given, for example, by mozilla. Since I have really little knowledge
>>> in html I can't even understand if this is a python or html problem.
>>> Can anyone give me some help?
>>> Thanks in advance.
>>> 
>>> 
>DA> I don't think this is a Python issue, but a "raw read" versus an
>DA> interactive interpretation of a page.  The browser does lots more than a
>DA> single roundtrip defined by urlopen/read.

>DA> I also would love to see some explanation of what happens here, or a
>DA> pointer to a reference that would help me understand it.

>DA> I took the output of the read(), and formatted it, roughly, as html.  I
>DA> expected to find a refresh, which is the simplest way that one page can
>DA> cause a very different one to be loaded.
>DA>      <meta http-equiv="refresh" content="1;url=someotherurl" />

>DA> If Mozilla had seen a page with this line in an appropriate place, it'd
>DA> immediately begin loading the other page, at "someotherurl"  But there's no
>DA> such line.

>DA> Next, I looked for javascript.  The Mozilla page contains lots of
>DA> javascript, but there's none in the raw page.  So I can't explain Mozilla's
>DA> differences that way.

>DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way
>DA> a CSS file could cause the content to change, just the display.

>DA> All I can guess is that it has something to do with "browser type" or
>DA> cookies.  And that would make lots of sense if this was a cgi page.  But
>DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of
>DA> another dozen special suffixes.

>DA> Any hints, anybody???

If you look into the HTML that Firefox gets, there is a lot of
javascript in it.
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org



More information about the Python-list mailing list