Parsing Baseball Stats

Wed Jul 26 18:55:39 EDT 2006

Hi.

  The webpage you need to parse is not very wellformed (I think), but
no problem. perhaps the best option is to locate the portion of HTML yo
want, in this case from "<h3 class="cardsect">Actual Pitching
Statistics </h3><pre>" to "</pre>". Between this you have a few entries
like this one: " 19 <a
href=http://www.baseballprospectus.com/dt//1914BOS-A.shtml>1914
BOS-A</a>   2   1   0   3.91    4    3    96   23.0   21   12   10    1
   7    3   0   0   0   0   1   0".

I'll put here a little portion of code using RE that I think will help
you to develop the rest of the app.

import re
data=" 19 <a
href=http://www.baseballprospectus.com/dt//1914BOS-A.shtml>1914
BOS-A</a>   2   1   0   3.91    4    3    96   23.0   21   12   10    1
   7    3   0   0   0   0   1   0"
pt=re.compile("(<a.*?>|</a>)") # this and the next line delete the html
tags
data1=pt.sub("",data) # Now data1 doesn't contain any html tag
pt=re.compile(" +") # This sentence and te next will substitute spaces
by "-"
data2=pt.sub("-",data1)
arrange_data=data2.aplit("-") # this make a list with data

after this few sentences you'll have a list with the data you need,
like the next:
['', '19', '1914', 'BOS', 'A', '2', '1', '0', '3.91', '4', '3', '96',
'23.0', '21', '12', '10', '1', '7', '3', '0', '0',
'0', '0', '1', '0']

I think is a good init for you.

Tell me if you can resolve the the problem with this or if you need
more help.

Bye