difference between urllib2.urlopen and firefox view 'page source'?

John Nagle nagle at animats.com
Tue Mar 20 13:32:33 EDT 2007


    Here's a useful online tool that might help you see what's happening:

	http://www.sitetruth.com/experimental/viewer.html

We use this to help webmasters see what our web crawler is seeing.

    This reads a page, using Python and FancyURLOpener, with a
USER-AGENT string of "SiteTruth.com site rating system."
Then it parses the page with BeautifulSoup, removes all
<SCRIPT>, <EMBED>, and <OBJECT> tags, makes all the links
absolute, then writes the page back out in UTF-8 Unicode.
The resulting cleaned-up page is displayed.

    If the page you're trying to read looks OK with our viewer,
you should be able to read it from Python with no problems.

				John Nagle

cjl wrote:
> Hi.
> 
> I am trying to screen scrape some stock data from yahoo, so I am
> trying to use urllib2 to retrieve the html and beautiful soup for the
> parsing.
> 
> Maybe (most likely) I am doing something wrong, but when I use
> urllib2.urlopen to fetch a page, and when I view 'page source' of the
> exact same URL in firefox, I am seeing slight differences in the raw
> html.
> 
> Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
> Is yahoo detecting that urllib2 doesn't process javascript, and
> passing different data?
> 
> -cjl
> 



More information about the Python-list mailing list