Parsing Baseball Stats

Ankit ankitdesai at gmail.com
Wed Jul 26 08:41:12 EDT 2006


Frederic,

Thanks for posting the solution. I used the original solution you
posted and it worked beautifully.

Paul,

I understand your concern for the site's TOS. Although, this may not
mean anything, the reason I wanted this "parser" was because I wanted
to get the Advanced, and Translated Stats for personal use. I don't
have any commercial motives but play with baseball stats is my hobby.
The site does allow one to download stuff for personal use, which I
abide by. Also, I am only looking to get the aforementioned stats for
some players. The site has player pages for over 16,000 players. I
think it would be unfair to the site owners if I went to download all
16,000 players using the script. In the end, they might just move the
stats in to their premium package (not free) and then I would be really
screwed.

So, I understand your concerns and thank you for posting them.

Ankit

Anthra Norell wrote:
> ----- Original Message -----
> From: "Paul McGuire" <ptmcg at austin.rr._bogus_.com>
> Newsgroups: comp.lang.python
> To: <python-list at python.org>
> Sent: Wednesday, July 26, 2006 1:01 AM
> Subject: Re: Parsing Baseball Stats
>
>
> > "Anthra Norell" <anthra.norell at tiscalinet.ch> wrote in message
> > news:mailman.8551.1153861590.27775.python-list at python.org...
> > >
>           snip
> > >
> > Frederic -
> >
> > HTML parsing is one of those slippery slopes - or perhaps "tar babies" might
> > be a better metaphor - that starts out as a simple problem, but then one
> > exception after the next drags the solution out for daaaays.  Probably once
> > or twice a week, there is a posting here from someone trying to extract data
> > from a website, usually something like trying to pull the href's out of some
>
>           snip
>
> > So what started out as a little joke (microscopic, even) has eventually
> > touched a nerve, so thanks and apologies to those who have read this whole
> > mess.  Frederic, SE looks like a killer - may it become the next regexp!
> >
> > -- Paul
> >
>
> Paul,
>
> A year ago or so someone posted a call for ideas on encoding passwords for his own private use. I suggested a solution using
> python's random number generator and was immediately reminded by several knowledgeable people, quite sharply by some, that the
> random number generator was not to be used for cryptographic applications, since the doc specifically said so. I was also given good
> advice on what to read.
>       I thought that my solution was good, if not by the catechism, then by the requirements of the OP's problem which I considered
> to be the issue. I hoped the OP would come back with his opinion, but he didn't.
>       Not then and there. He did some time later, off list, telling me privately that he had incorporated my solution with some
> adaptations and that it was exactly what he had been looking for.
>
> So let me pursue this on two lines: A) your response and B) the issue.
>
> A) I thank you for the considerable time you must have taken to explain pyparse in such detail. I didn't know you're the author.
> Congratulations! It certainly looks very professional. I have no doubt that it is an excellent and powerful tool.
>       Thanks also for your explanation of the TOS concept. It isn't alien to me and I have no problem with it. But I don't believe
> it means that one should voluntarily argue against one's own freedom, barking at oneself with the voice of the legal watchdogs out
> there that would restrict our freedom preemptively, getting a tug on the leash for excessive zeal but a pat on the head nontheless.
> We have little cause to assume that the OP is setting up a baseball information service and have much cause to assume that he is
> not. So let us reserve the benefit of the doubt because this is what the others do. And work by plausible assumption--necessarily,
> because the realm of certainty is too small an action base.
>       SE is not a parser. It is a stream editor. I believe it fills a gap, handling a certain kind of problem very gracefully while
> being particularly easy to use. Your spontaneous reaction of horror was the consequence of a misinterpretation. The Tag_Stripper's
> argument ('"~<.*?>~= " "~<[^>]*~=" "~[^<]*>~=") is not the frightful incarnation of a novel, yet more arcane regular expression
> syntax. It is simply a string consisting of three very simple expressions: '<.*?>', '<[^>]*' and '[^<]*>'. They could also be
> written as or-ed alternatives: '<.*?>|<[^>]*|[^<]*>'. The tildes brace the regex to identify it as such. The equal sign says replace
> what precedes with what follows. Nothing happens to follow, which means replace it with nothing, which means delete it (tags).
> That's all. SE allows--encourages--to break down a complex search into any number of simple components.
>       (Having just said 'easy to use' I notice a mistake. I correct it below in section C.)
>
> B) I would welcome the OP's opinion.
>
> Regards
>
> Frederic
>
>
> C) Correction: The second and third expression were meant to catch tags spanning lines. There weren't any such tags and so the
> expressions were useless--and inoffensive too: the second one, as a matter of fact, could also delete text. The Tag Stripper should
> be defined like this:
>
> Tag_Stripper = ('"~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~="')
>
> It now deletes tags even if they span lines and it incorporates a second definition that deletes comments which, as you made me
> aware, may contain tags. I now have to run the whole file through this before I look at the lines.
>
> def get_statistics (name_of_player):
>
>    statistics = {
>      'Actual Pitching Statistics'   : [],
>      'Advanced Pitching Statistics' : [],
>    }
>
>    url = 'http://www.baseballprospectus.com/dt/%s.shtml' % name_of_player
>    htm_page = urllib.urlopen (url)
>    lines = StringIO.StringIO (Tag_Stripper (htm_page.read ()))
>    htm_page.close ()
>    current_list = None
>    for line in lines:
>       line = line.strip ()
>       if line == '':
>          continue
>       if 'Statistics' in line:  # That's the section headings.
>          if statistics.has_key (line):
>             current_list = statistics [line]
>             current_list.append (line)
>          else:
>             current_list = None
>       else:
>          if current_list != None:
>             current_list.append (CSV_Maker (line))
>
>    return statistics
>
>
> show_statistics (statistics) displays this tab-delimited CSV:
>
> Advanced Pitching Statistics
> AGE YEAR TEAM XIP RA DH DR DW NRA RAA PRAA PRAR DERA NRA RAA PRAA PRAR DERA STF
> 19 1914 BOS-A 25.3 4.70 -2 3 1 5.75 -4 -5 -2 6.15 6.19 -5 -5 -2 6.36 -25
> 20 1915 BOS-A 225.3 3.31 -12 3 2 4.01 12 4 45 4.33 4.25 6 1 42 4.44 12
> 21 1916 BOS-A 318.2 2.31 -32 -8 0 3.19 46 41 101 3.35 3.30 43 39 99 3.41 24
> 22 1917 BOS-A 336.5 2.56 -20 -7 1 3.49 38 23 83 3.88 3.72 29 20 80 3.96 13
> 23 1918 BOS-A 171.6 2.76 -16 5 0 3.80 13 6 34 4.20 4.16 6 3 31 4.36 3
> 24 1919 BOS-A 129.4 3.98 4 -16 2 4.63 -2 -2 19 4.61 4.79 -4 -3 17 4.70 -6
> 25 1920 NY_-A 6.4 9.00 -1 3 1 8.64 -3 -3 -3 8.96 8.95 -3 -3 -3 9.14 -35
> 26 1921 NY_-A 13.2 10.00 2 0 1 9.16 -7 -7 -5 9.36 9.61 -8 -8 -5 9.65 -41
> 35 1930 NY_-A 8.8 3.00 1 -2 0 2.84 2 2 4 2.57 3.07 1 2 3 2.66 13
> 38 1933 NY_-A 8.8 5.00 1 -1 0 5.01 -1 0 0 4.59 5.27 -1 0 0 4.73 -22
> 1243.5 2.95 -76 -22 8 3.78 96 59 275 4.07 3.95 65 45 262 4.17 10
>
> Actual Pitching Statistics
> AGE YEAR TEAM W L SV ERA G GS TBF IP H R ER HR BB SO HBP IBB WP BK CG SHO
> 19 1914 BOS-A 2 1 0 3.91 4 3 96 23.0 21 12 10 1 7 3 0 0 0 0 1 0
> 20 1915 BOS-A 18 8 0 2.44 32 28 874 217.7 166 80 59 3 85 112 6 0 9 1 16 1
> 21 1916 BOS-A 23 12 1 1.75 44 41 1272 323.7 230 83 63 0 118 170 8 0 3 1 23 9
> 22 1917 BOS-A 24 13 2 2.01 41 38 1277 326.3 244 93 73 2 108 128 11 0 5 0 35 6
> 23 1918 BOS-A 13 7 0 2.22 20 19 660 166.3 125 51 41 1 49 40 2 0 3 1 18 1
> 24 1919 BOS-A 9 5 1 2.97 17 15 570 133.3 148 59 44 2 58 30 2 0 5 1 12 0
> 25 1920 NY_-A 1 0 0 4.50 1 1 17 4.0 3 4 2 0 2 0 0 0 0 0 0 0
> 26 1921 NY_-A 2 0 0 9.00 2 1 49 9.0 14 10 9 1 9 2 0 0 0 0 0 0
> 35 1930 NY_-A 1 0 0 3.00 1 1 39 9.0 11 3 3 0 2 3 0 0 0 0 1 0
> 38 1933 NY_-A 1 0 0 5.00 1 1 42 9.0 12 5 5 0 3 0 0 0 0 0 1 0
> 94 46 4 2.28 163 148 4896 1221.3 974 400 309 10 441 488 29 0 25 4 107 17
> 
> (The last line remains to be shifted three columns to the right.)




More information about the Python-list mailing list