Newbie..Needs Help

Nick Vatamaniuc vatamane at gmail.com
Fri Jul 28 12:36:57 EDT 2006


Graham,

I won't write the program for you since I have my own program to work
on but here is an idea how to do it.
1) Need to have a function to download the page -- use the urllib
module. Like this:
import urllib
page=urllib.urlopen(URL_GOES_HERE).read()

2) Go to the page with your browser and view the source of the html.
You will need to find specific html patterns that you can use to
identify the boundaries between each race first. A good one would be
the actual title 'Race 1 results:', then you have 'Race 2 results:' and
so on until 'Race 8 results:'. From this you need to derive a regular
expression in Python (here is documenation
http://docs.python.org/lib/module-re.html) to express all those
boundaries as one pattern it is: 'Race [0-9]+ results:'. In other words
the word 'Race' then space then a digit repeated one or more times then
another space and 'results:'. So you can do:
races_pattern=re.comple(r'Race [0-9]+ results:') # <- this is your
pattern
chunks=races_pattern.split(page) #<- split the page into chunks based
on the pattern
you will have 9 chunks if you have 8 races. The first one will be all
the stuff before the title (i.e. the start of the page), throw it away:
chunks=chunks[1:]

3) Now go back to the html source and look inside each race at the
table with the results, find a pattern for  a good boundary between
table rows. Again use the regular expressions like before to split each
table away from other junk, then each table into rows (use <tr>)

4) Look again at the source, and split each row into data cells (use
<td>).

5) Then for each of the split cell chunks remove the html tag data with

chunk=re.sub('<.*?>', '', chunk)

6) Now all you should have is pure data stored in strings in  each of
the data cell chunks  in each of the  table row in each of the table.

7) Then save to text file and import into your database.

Anyway that's the general idea, there are other ways to do it, but
that's my approach. I wrote a couple of screen scrapping applications
like this before  in Python and I used this method and it worked well
enough.

Good luck,
Nick V.




Graham Feeley wrote:
> Thanks Nick for the reply
> Of course my first post was a general posting to see if someone would be
> able to help
> here is the website which holds the data I require
> http://www.aapracingandsports.com.au/racing/raceresultsonly.asp?storydate=27/07/2006&meetings=bdgo
>
> The fields required are as follows
>  NSW Tab
> #      Win      Place
>  2    $4.60   $2.40
>  5                $2.70
>  1                $1.30
>  Quin    $23.00
>  Tri  $120.70
> Field names are
> Date   ( not important )
> Track................= Bendigo
> RaceNo............on web page
> Res1st...............2
> Res2nd..............5
> Res3rd..............1
> Div1..................$4.60
> DivPlc...............$2.40
> Div2..................$2.70
> Div3..................$1.30
> DivQuin.............$23.00
> DivTrif...............$120.70
> As you can see there are a total of 6 meetings involved and I would need to
> put in this parameter ( =bdgo) or (=gosf) these are the meeting tracks
>
> Hope this more enlightening
> Regards
> graham
>
> "Graham Feeley" <grahamjfeeley at optusnet.com.au> wrote in message
> news:44ca0e2b$0$1207$afc38c87 at news.optusnet.com.au...
> > Hi this is a plea for some help.
> > I am enjoying a script that was written for me and its purpose is to
> > collect data from a web site and puts it into a access database table.
> > It works fine, however it is a sports info table but now I need to collect
> > the results of those races.
> >
> > I simply can't keep up putting the results in manually.
> > I dont care if it is a access table or a text file ( whichever is easiest)
> > there are only 12 fields to extract
> > The person who wrote the script is not available as he is engrossed in
> > another project which is talking all his time.
> > I hope someone has a little time on his hands willing to help me
> > Regards
> > Graham
> >
> >
> >




More information about the Python-list mailing list