Newbie..Needs Help

Sat Jul 29 16:43:59 EDT 2006

----- Original Message -----
From: "Graham Feeley" <grahamjfeeley at optusnet.com.au>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Friday, July 28, 2006 5:11 PM
Subject: Re: Newbie..Needs Help

> Thanks Nick for the reply
> Of course my first post was a general posting to see if someone would be
> able to help
> here is the website which holds the data I require
> http://www.aapracingandsports.com.au/racing/raceresultsonly.asp?storydate=27/07/2006&meetings=bdgo
>
> The fields required are as follows
>  NSW Tab
> #      Win      Place
>  2    $4.60   $2.40
>  5                $2.70
>  1                $1.30
>  Quin    $23.00
>  Tri  $120.70
> Field names are
> Date   ( not important )
> Track................= Bendigo
> RaceNo............on web page
> Res1st...............2
> Res2nd..............5
> Res3rd..............1
> Div1..................$4.60
> DivPlc...............$2.40
> Div2..................$2.70
> Div3..................$1.30
> DivQuin.............$23.00
> DivTrif...............$120.70
> As you can see there are a total of 6 meetings involved and I would need to
> put in this parameter ( =bdgo) or (=gosf) these are the meeting tracks
>
> Hope this more enlightening
> Regards
> graham
>

Graham,

Only a few days ago I gave someone a push who had a very similar problem. I handed him code ready to run. I am doing it again for
you.
      The site you use is much harder to interpret than the other one was and so I took the opportunity to experimentally stretch
the envelope of a new brain child of mine: a stream editor called SE. It is new and so I also take the opportunity to demo it.
      One correspondent in the previous exchange was Paul McGuire, the author of 'pyparse'. He made a good case for using 'pyparse'
in situations like yours. Unlike a stream editor, a parser reads structure in addition to data and can relate the data to its
context.
      Anlayzing the tables I noticed that they are poorly structured: The first column contains both data and ids. Some records are
shorter than others, so column ids have to be guessed and hard coded. Missing data sometimes is a dash, sometimes nothing. The
inconsistencies seem to be consistent, though, down the eight tables of the page. So they can be formalized with some confidence
that they are systematic. If Paul could spend some time on this, I'd be much interested to see how he would handle the relative
disorder.
      Another thought: The time one invests in developing a program should not exceed the time it can save overall (not talking
about recreational programming). Web pages justify an extra measure of caution, because they may change any time and when they do
they impose an unscheduled priority every time the reader stops working and requires a revision.

So, here is your program. I write it so you can copy the whole thing to a file. Next copy SE from the Cheese Shop. Unzip it and put
both SE.PY and SEL.PY where your Python progams are. Then 'execfile' the code in an IDLE window, call 'display_horse_race_data
('Bendigo', '27/07/2006') and see what happens. You'll have to wait ten seconds or so.

Regards

Frederic

######################################################################################

TRACKS = { 'New Zealand' : '',
           'Bendigo'     : 'bdgo',
           'Gosford'     : 'gosf',
           'Northam'     : 'nthm',
           'Port Augusta': 'pta',
           'Townsville'  : 'town',
         }

# This function does it all once all functions are loaded. If nothing shows, the
# page has not data.

def display_horse_race_data (track, date, clip_summary = 100):

   """
      tracks: e.g. 'Bendigo' or 'bdgo'
      date: e.g. '27/07/2006'
      clip_summary: each table has a long summary header.
        the argument says hjow much of it to show.
   """

   if track [0].isupper ():
      if TRACKS.has_key (track):
         track = TRACKS [track]
      else:
         print 'No such track %s' % track
         return
   open ()
   header, records = get_horse_race_data (track, date)
   show_records (header, records, clip_summary)

######################################################################################

import SE, urllib

_is_open = 0

def open ():

   global _is_open

   if not _is_open:   # Skip repeat calls

      global Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator, CSV_Maker

      # Making the following Editors is a step-by-step process, adding one element at a time and
      # looking at what it does and what should be done next.
      # Get pertinent data segments
      header            = ' "~(?i)Today\'s Results - .+?<div style="padding-top:5px;">~==*END*OF*HEADER*" '
      race_summary      = ' "~(?i)Race [1-9].*?</font><br>~==" '
      data_segment      = ' "~(?i)<table border=0 width=100% cellpadding=0 cellspacing=0>(.|\n)*?</table>~==*END*OF*SEGMENT*" '
      Data_Filter = SE.SE (' <EAT> ' + header + race_summary + data_segment)

      # Some data items are empty. Fill them with a dash.
      mark_null_data = ' "~(?i)>\s* \s*</td>~=>-" '
      Null_Data_Marker = SE.SE (mark_null_data + ' " = " ')

      # Dump the tags
      eat_tags     = ' "~<(.|\n)*?>~=" '
      eat_comments = ' "~<!--(.|\n)*?-->~=" '
      Tag_Stripper = SE.SE (eat_tags + eat_comments + ' (13)= ')

      # Visual inspection is easier without all those tabs and empty lines
      Space_Deflator = SE.SE ('"~\n[\t ]+~=(10)" "~[\t ]+\n=(10)" | "~\n+~=(10)"')

      # Translating line breaks to tabs will make a tab-delimited CSV
      CSV_Maker = SE.SE ( '(10)=(9)' )

      _is_open = 1   # Block repeat calls

def close ():

   """Call close () if you want to free up memory"""

   global Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator, CSV_Maker
   del Data_Filter, Null_Data_Marker, Tag_Stripper, Space_Deflator, CSV_Maker
   urllib.urlcleanup ()
   del urllib
   del SE

def get_horse_race_data (track, date):

   """tracks: 'bndg' or (the other one)
      date: e.g. '27/07/2006'
      The website shows partial data or none at all, probably depending on
      race schedules. The relevance of the date in the url is unclear.
   """

   def make_url (track, date):
      return 'http://www.aapracingandsports.com.au/racing/raceresultsonly.asp?storydate=%s&meetings=%s' % (date, track)

   page = urllib.urlopen (make_url (track, date))
   p = page.read ()
   page.close ()
   # When developing the program, don't get the file from the internet on
   # each call. Download it and read it from the hard disk.

   raw_data = Data_Filter (p)
   raw_data_marked = Null_Data_Marker (raw_data)
   raw_data_no_tags = Tag_Stripper (raw_data_marked)
   raw_data_compact = Space_Deflator (raw_data_no_tags)
   data = CSV_Maker (raw_data_compact)
   header, tables = data.split ('*END*OF*HEADER*', 1)
   records = tables.split ('*END*OF*SEGMENT*')
   return header, records [:-1]

def show_record (record, clip_summary = 100):

   """clip_summary: None will display it all"""

   # The records all have 55 fields.
   # These are the relevant indexes:
   SUMMARY                   =  0
   FIRST                     =  8
   FIRST_NSWTAB_WIN          =  9
   FIRST_NSWTAB_PLACE        = 10
   FIRST_TABCORP_WIN         = 11
   FIRST_TABCORP_PLACE       = 12
   FIRST_UNITAB_WIN          = 13
   FIRST_UNITAB_PLACE        = 14
   SECOND                    = 15
   SECOND_NSWTAB_PLACE       = 17
   SECOND_TABCORP_PLACE      = 19
   SECOND_UNITAB_PLACE       = 21
   THIRD                     = 22
   THIRD_NSWTAB_PLACE        = 23
   THIRD_TABCORP_PLACE       = 24
   THIRD_UNITAB_PLACE        = 25
   QUIN_NSWTAB_PLACE         = 28
   QUIN_TABCORP_PLACE        = 30
   QUIN_UNITAB_PLACE         = 32
   EXACTA_NSWTAB_PLACE       = 35
   EXACTA_TABCORP_PLACE      = 37
   EXACTA_UNITAB_PLACE       = 39
   TRI_NSWTAB_PLACE          = 41
   TRI_TABCORP_PLACE         = 42
   TRI_UNITAB_PLACE          = 43
   DDOUBLE_NSWTAB_PLACE      = 46
   DDOUBLE_TABCORP_PLACE     = 48
   DDOUBLE_UNITAB_PLACE      = 50
   SUB_SCR_NSW               = 52
   SUB_SCR_TABCORP           = 53
   SUB_SCR_UNITAB            = 54

   if clip_summary == None:
      print record [SUMMARY]
   else:
      print record [SUMMARY] [:clip_summary] + '...'
      print

   # Your specification:
   # Date   ( not important )          -> In url and summary of first record
   # Track................= Bendigo    -> In url and summary of first record
   # RaceNo............on web page     -> In summary (index of record + 1?)
   # Res1st...............2
   # Res2nd..............5
   # Res3rd..............1
   # Div1..................$4.60
   # DivPlc...............$2.40
   # Div2..................$2.70
   # Div3..................$1.30
   # DivQuin.............$23.00
   # DivTrif...............$120.70

   print 'Res1st  > %s' % record [FIRST]
   print 'Res2nd  > %s' % record [SECOND]
   print 'Res3rd  > %s' % record [THIRD]
   print 'Div1    > %s' % record [FIRST_NSWTAB_WIN]
   print 'DivPlc  > %s' % record [FIRST_NSWTAB_PLACE]
   print 'Div2    > %s' % record [SECOND_NSWTAB_PLACE]
   print 'Div3    > %s' % record [THIRD_NSWTAB_PLACE]
   print 'DivQuin > %s' % record [QUIN_NSWTAB_PLACE]
   print 'DivTrif > %s' % record [TRI_NSWTAB_PLACE]

   # Add others as you like from the list of index names above

def show_records (header, records, clip_summary = 100):

   print '\n%s\n' % header
   for record in records:
      show_record (record.split ('\t'), clip_summary)
      print '\n'

##########################################################################
#
# show_records (records, 74) displays:
#
# Today's Results - 27/07/2006 BENDIGO
#
# Race 1 results:Carlsruhe Roadhouse Mdn Plate $11,000 2yo Maiden 1400m Appr...
#
# Res1st  > 2
# Res2nd  > 5
# Res3rd  > 1
# Div1    > $4.60
# DivPlc  > $2.40
# Div2    > $2.70
# Div3    > $1.30
# DivQuin > $23.00
# DivTrif > $120.70
#
#
# Race 2 results:Gerard K. House P/L Mdn Plate $11,000 3yo Maiden 1400m Appr...
#
# Res1st  > 6
# Res2nd  > 7
# Res3rd  > 5
# Div1    > $3.50
# DivPlc  > $1.60
# Div2    > $2.60
# Div3    > $1.40
# DivQuin > $18.60
# DivTrif > $75.80
#
#
# Race 3 results:Richard Cambridge Printers Mdn $11,000 3yo Maiden 1400m Appr...
#
# Res1st  > 11
# Res2nd  > 12
# Res3rd  > 1
# Div1 ...
#
# ... etc
#