Suitable Python code to scrape specific details from web pages.

Tue Aug 12 16:11:47 EDT 2014

On Tue, 12 Aug 2014 13:00:30 -0700 (PDT)
Simon Evans <musicalhacksaw at yahoo.co.uk> wrote:

> Dear Programmers,
> I have been looking at the You tube 'Web Scraping Tutorials' of Chris Reeves. I have tried a few of his python programs in the Python27 command prompt, but altered them from accessing data using links say from the Dow Jones index, to accessing the details I would be interested in accessing from the 'Racing Post' on a daily basis. Anyhow, the code it returns is not in the example I am going to give, is not the information I am seeking, instead of returning the given odds on a horse, it only returns a [], which isn't much use. 
> I would be glad if you could tell me where I am going wrong. 
> Yours faithfully
> Simon Evans.
> --------------------------------------------------------------------------------
> >>>import urllib
> >>>import re
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
> 
> race_id=600048r_date=2014-05-08#raceTabs=sc_")
> htmltext = htmlfile.read()
> regex = '<strong>1<a href="http://www.racingpost.com/horses/horse_home.sd?
> 
> horse_id=758752"onclick="scorecards.send("horse_name&quot:):return Html.popup(this,
> 
> {width:695,height:800})"title="Full details about this HORSE">Lively 
> 
> Baron</a>9/4F</strong><br/>'
> >>>pattern = re.compile(regex)
> >>>odds=re.findall(pattern,htmltext)
> >>>print odds
> []
> >>>
> --------------------------------------------------------------------------------
> >>>import urllib
> >>>import re
> >>>htmlfile = urllib.urlopen("http://www.racingpost.com/horses2/cards/card.sd?
> 
> >>>race_id=600048r_date=2014-05-08#raceTabs=sc_")
> >>>htmltext = htmlfile.read()
> >>>regex = '<a></a>'
> >>>pattern = re.compile(regex)
> >>>odds=re.findall(pattern,htmltext)
> >>>print odds
> []
> >>>
> -------------------------------------------------------------------------------

If you want web scraping, you want to use
http://www.crummy.com/software/BeautifulSoup/ .  End of story.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.