my first screen scraper

mensanator at aol.com mensanator at aol.com
Sat Dec 1 23:25:26 EST 2007


##  I was looking in my database of movie grosses I regulary copy
##  from the Internet Movie Database and noticed I was _only_ 120
##  weeks behind in my updates.
##
##  Ouch.
##
##  Copying a web page, pasting into a text file, running a perl
##  script to convert it into a csv file and manually importing it
##  into Access isn't so bad when you only have a couple to do at
##  a time. Still, it's a labor intensive process and 120 isn't
##  anything to look forwards to.
##
##  But I abandoned perl years ago when I took up Python, so I
##  can use Python to completely automate the process now.
##
##  Just have to figure out how.
##
##  There's 3 main tasks: capture the web page, parse the web page
##  to extract the data and insert the data into the database.
##
##  But I only know how to do the last step, using the odnc tools
##  from win32,

import dbi
import odbc
import re

##  so I snoop around comp.lang.python to pick up some
##  hints and keywords on how to do the other two tasks.
##
##  Documentation on urllib2 was a bit vague, but got the web page
##  after only a ouple mis-steps.

import urllib2

##  Unfortunately, HTMLParser remained beyond my grasp (is it
##  my imagination or is the quality of the examples in the
##  doumentation inversely proportional to the subject
##  difficulty?)
##
##  Luckily, my bag of hints had a reference to Beautiful Soup,
##  whose web site proclaims:
##      Beautiful Soup is a Python HTML/XML parser
##      designed for quick turnaround projects like
##      screen-scraping.
##  Looks like just what I need, maybe I can figure it out after all.

from BeautifulSoup import BeautifulSoup

target_dates = [['11','24','2006','November']]

con = odbc.odbc("IMDB")  # connect to MS-Access database
cursor = con.cursor()

for d in target_dates:
  #
  # build url (with CGI parameters) from list of dates needing
updating
  #
  the_year = d[2]
  the_date = '/'.join([d[0],d[1],d[2]])
  print '%10s scraping IMDB:'  % (the_date),
  the_url = ''.join([r'http://www.imdb.com/BusinessThisDay?
day=',d[1],'&month=',d[3]])
  req = urllib2.Request(url=the_url)
  f = urllib2.urlopen(req)
  www = f.read()
  #
  # ok, page captured. now make a BeatifulSoup object from it
  #
  soup = BeautifulSoup(www)
  #
  # that was easy, much more so than HTMLParser
  #
  # now, _all_ I have to do is figure out how to parse it
  #
  # ouch again. this is a lot harder than it looks in the
  # documentation. I need to get the data from cells of a
  # table nested inside another table and that's hard to
  # extrapolate from the examples showing how to find all
  # the comments on a web page.
  #
  # but this looks promising. if I grab all the table rows
  # (tr tags), each complete nested table is inside a cell
  # of the outer table (whose table tags are lost, but aren't
  # needed and whose absence makes extracting the nested
  # tables easier (when you do it the stupid way, but hey,
  # it works, so I'm sticking with it))
  #
  tr = soup.tr                          # table rows
  tr.extract()
  #
  # now, I only want the third nested table. how do I get it?
  # can't seem to get past the first one, should I be using
  # NextSibling or something? <scratches head...>
  #
  # but wait...I don't need the first two tables, so I can
  # simply extract and discard them. and since .extract()
  # CUTS the tables, after two extractions the table I want
  # IS the first one.
  #
  the_table = tr.find('table')          # discard
  the_table.extract()
  the_table = tr.find('table')          # discard
  the_table.extract()
  the_table = tr.find('table')          # weekly gross
  the_table.extract()
  #
  # of course, the data doesn't start in the first row,
  # there's formatting, header rows, etc. looks like it starts
  # in tr number [3]
  #
  ##  >>> the_table.contents[3].td
  ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
  #
  # and since tags always imply the first one, the above
  # is equivalent to
  #
  ##  >>> the_table.contents[3].contents[0]
  ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
  #
  # and since the title is the first of three cells, the
  # reporting year is
  #
  ##  >>> the_table.contents[3].contents[1]
  ##  <td> <a href="/Sections/Years/2001">2001</a> </td>
  #
  # finally, the 3rd cell must contain the gross
  #
  ##  >>> the_table.contents[3].contents[2]
  ##  <td align="RIGHT"> 259,674,120</td>
  #
  # but the contents of the first two cells are anchor tags.
  # to get the actual title string, I need the contents of the
  # contents. but that's not exactly what I want either,
  # I don't want a list, I need a string. and the string isn't
  # always in the same place in the list
  #
  # summarizing, what I need is
  #
  ##  print the_table.contents[3].contents[0].contents[0].contents,
  ##  print the_table.contents[3].contents[1].contents[1].contents,
  ##  print the_table.contents[3].contents[2].contents
  #
  # and that almost works, just a couple more tweaks and I can
  # shove it into the database

  parsed = []

  for rec in the_table.contents[3:]:
    the_rec_type = type(rec)                      # some rec are
NavSrings, skip
    if str(the_rec_type) == "<type 'instance'>":
      #
      # ok, got a real data row
      #
      TITLE_DATE = rec.contents[0].contents[0].contents   # a list
inside a tuple
      #
      # and that means we still have to index the contents
      # of the contents of the contents of the contents by
      # adding [0][0] to TITLE_DATE
      #
      YEAR =  rec.contents[1].contents[1].contents        # ditto
      #
      # this won't go into the database, just used as a filter to grab
      # the records associated with the posting date and discard
      # the others (which should already be in the database)
      #
      GROSS = rec.contents[2].contents                    # just a
list
      #
      # one other minor glitch, that film date is part of the title
      # (which is of no use in the database), so it has to be pulled
out
      # and put in a seperate field
      #
      temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE[0][0]))
      #
      # which works 99% of the time. unfortunately, the IMDB
      # consitency is somewhat dubious. the date is _supposed_
      # to be at the end of the string, but sometimes it's not.
      # so, usually, there are only 5 groups, but you have to
      # allow for the fact that there may be 6
      #
      try:
        the_title = temp_title.group(1) + temp_title.group(5)
      except:
        the_title = temp_title.group(1)
      the_gross = str(GROSS[0])
      #
      # and for some unexplained reason, dates will occasionally
      # be 2001/I instead of 2001, so we want to discard the trailing
      # crap, if any
      #
      the_film_year = temp_title.group(3)[:4]
      if str(YEAR[0][0])==the_year:
        parsed.append([the_date,the_title,the_film_year,the_gross])

  print '%3d records found ' % (len(parsed)),
  #
  # wow, now just have to insert all the update records directly
  # into the database...into a temporary table, of course. as I said,
  # IMDB consistency is somewhat dubious (such as changing the
spelling
  # of the titles), so a QC check will be required inside Access
  #
  if len(parsed)>0:
    print '...inserting into database'
    for p in parsed:
      cursor.execute("""
INSERT INTO imdweeks2 ( Date_reported, Title, Film_Date,
Gross_to_Date )
SELECT ?,?,?,?;""",p)
  else:
    print '...aborting, no records found'

cursor.close()
con.close()

# and just because it works, doesn't mean it's right.
# but hey, you get what you pay for. I'm _sure_ if I were
# to pay for a subscription to IMDBPro, I wouldn't see
# these errors ;-)

##  5 records found for 11/26/2006... ...inserting into database
##
##  strange, only 5 films were in release over Thanksgiving weekend?
##
##  of course not, IMDB screwed up (holidays being chronic problems
##  for IMDB). for some reason, the others were posted on Friday
##  instead of Sunday. at least that can be corrected by scraping
##  the page for 11/24.
##
##    mis-reported: records found on 11/24
##  11/24/2006 scraping IMDB:  88 records found  ...inserting into
database

##
##  other data is just plain missing (don't they do QA/QC at IMDB?
##  how hard is it figure out that each week has 60-120 records?)
##
##  4 records found for 3/25/2007... ...inserting into database
##    records appear to be missing
##  6 records found for 4/8/2007... ...inserting into database
##    records appear to be missing
##  17 records found for 4/15/2007... ...inserting into database
##    records appear to be missing
##  4 records found for 5/13/2007... ...inserting into database
##    records appear to be missing
##  1 records found for 7/15/2007... ...inserting into database
##    records appear to be missing
##  10/14/2007 scraping IMDB: 128 records found  ...inserting into
database
##  10/21/2007 scraping IMDB: 130 records found  ...inserting into
database
##  10/28/2007 scraping IMDB:   1 records found  ...inserting into
database
##    records appear to be missing
##   11/4/2007 scraping IMDB: 105 records found  ...inserting into
database
##  11/11/2007 scraping IMDB:   2 records found  ...inserting into
database
##    records appear to be missing
##  11/18/2007 scraping IMDB: 101 records found  ...inserting into
database

##  other errors are legitimate. last week's grosses simply
##  haven't been posted yet. and they sometimes are weeks
##  behind. bet that doesn't happen with IMDBPro ;-)
##  11/25/2007 scraping IMDB:   0 records found  ...aborting, no
records found

##  so, next update session I start with 11/25

##  there, that wasn't so bad. only took about 40 minutes
##  to scrape 120 or so web pages. that's one of the things
##  I like about Python. even with documentation frustration,
##  I've got what I need at the end of the day



More information about the Python-list mailing list