using urllib on a more complex site

Sun Feb 24 20:28:00 EST 2013

On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote:
> On 02/24/2013 07:02 PM, Adam W. wrote:
> 
> > I'm trying to write a simple script to scrape http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
> 
> >
> 
> > in order to send myself an email every day of the 99c movie of the day.
> 
> >
> 
> > However, using a simple command like (in Python 3.0):
> 
> > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
> 
> >
> 
> > I don't get the all the source I need, its just the navigation buttons.  Now I assume they are using some CSS/javascript witchcraft to load all the useful data later, so my question is how do I make urllib "wait" and grab that data as well?
> 
> >
> 
> 
> 
> The CSS and the jpegs, and many other aspects of a web "page" are loaded 
> 
> explicitly, by the browser, when parsing the tags of the page you 
> 
> downloaded.  There is no sooner or later.  The website won't send the 
> 
> other files until you request them.
> 
> 
> 
> For example, that site at the moment has one image (prob. jpeg) 
> 
> highlighted,
> 
> 
> 
> <img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m" 
> 
> alt="Sex and the City: The Movie (Theatrical)">
> 
> 
> 
> if you want to look at that jpeg, you need to download the file url 
> 
> specified by the src attribute of that img element.
> 
> 
> 
> Or perhaps you can just look at the 'alt' attribute, which is mainly 
> 
> there for browsers who don't happen to do graphics, for example, the 
> 
> ones for the blind.
> 
> 
> 
> Naturally, there may be dozens of images on the page, and there's no 
> 
> guarantee that the website author is trying to make it easy for you. 
> 
> Why not check if there's a defined api for extracting the information 
> 
> you want?  Check the site, or send a message to the webmaster.
> 
> 
> 
> No guarantee that tomorrow, the information won't be buried in some 
> 
> javascript fragment.  Again, if you want to see that, you might need to 
> 
> write a javascript interpreter.  it could use any algorithm at all to 
> 
> build webpage information, and the encoding could change day by day, or 
> 
> hour by hour.
> 
> 
> 
> -- 
> 
> DaveA

The problem is, the image url you found is not returned in the data urllib grabs.  To be clear, I was aware of what urllib is supposed to do (ie not download image data when loading a page), I've used it before many times, just never had to jump through hoops to get at the content I needed.

I'll look into figuring out how to find XHR requests in Chrome, I didn't know what they called that after the fact loading, so now my searching will be more productive.