how would you...?

Sat May 17 01:08:56 EDT 2008

On May 16, 11:43�pm, Sanoski <Joshuajr... at gmail.com> wrote:
> I'm pretty new to programming. I've just been studying a few weeks off
> and on. I know a little, and I'm learning as I go. Programming is so
> much fun! I really wish I would have gotten into it years ago, but
> here's my question. I have a longterm project in mind, and I wont to
> know if it's feasible and how difficult it will be.
>
> There's an XML feed for my school that some other class designed. It's
> just a simple idea that lists all classes, room number, and the person
> with the highest GPA. The feed is set up like this. Each one of the
> following lines would also be a link to more information about the
> class, etc.
>
> Economics, Room 216, James Faker, 3.4
> Social Studies, Room 231, Brain Fictitious, 3.5
>
> etc, etc
>
> The student also has a picture reference that depicts his GPA based on
> the number. The picture is basically just a graph. I just want to
> write a program that uses the information on this feed.
>
> I want it to reach out to this XML feed, record each instance of the
> above format along with the picture reference of the highest GPA
> student, download it locally, and then be able to use that information
> in various was. I figured I'll start by counting each instance. For
> example, the above would be 2 instances.
>
> Eventually, I want it to be able to cross reference data you've
> already downloaded, and be able to compare GPA's, etc. It would have a
> GUI and everything too, but I am trying to keep it simple right now,
> and just build onto it as I learn.
>
> So lets just say this. How do you grab information from the web,

Depends on the web page.

> in this case a feed,

Haven't tried that, just a simple CGI.

> and then use that in calculations?

The key is some type of structure be it database records,
or a list of lists or whatever. Something that you can iterate
through, sort, find max element, etc.

> How would you
> implement such a project?

The example below uses BeautifulSoup. I'm posting it not
because it matches your problem, but to give you an idea of
the techniques involved.

> Would you save the information into a text file?

Possibly, but generally no. Text files aren't very useful
except as a data exchange media.

> Or would you use something else?

Your application lends itself to a database approach.
Note in my example the database part of the code is disabled.
Not every one has MS-Access on Windows.

> Should I study up on SQLite?

Yes. The MS-Access code I have can be easily changed to SQLlite.

> Maybe I should study classes.

I don't know, but I've always gotten along without them.

> I'm just not sure. What would be the most effective technique?

Don't know that either as I've only done it once, as follows:

##  I was looking in my database of movie grosses I regulary copy
##  from the Internet Movie Database and noticed I was _only_ 120
##  weeks behind in my updates.
##
##  Ouch.
##
##  Copying a web page, pasting into a text file, running a perl
##  script to convert it into a csv file and manually importing it
##  into Access isn't so bad when you only have a couple to do at
##  a time. Still, it's a labor intensive process and 120 isn't
##  anything to look forwards to.
##
##  But I abandoned perl years ago when I took up Python, so I
##  can use Python to completely automate the process now.
##
##  Just have to figure out how.
##
##  There's 3 main tasks: capture the web page, parse the web page
##  to extract the data and insert the data into the database.
##
##  But I only know how to do the last step, using the odnc tools
##  from win32,

####import dbi
####import odbc
import re

##  so I snoop around comp.lang.python to pick up some
##  hints and keywords on how to do the other two tasks.
##
##  Documentation on urllib2 was a bit vague, but got the web page
##  after only a ouple mis-steps.

import urllib2

##  Unfortunately, HTMLParser remained beyond my grasp (is it
##  my imagination or is the quality of the examples in the
##  doumentation inversely proportional to the subject
##  difficulty?)
##
##  Luckily, my bag of hints had a reference to Beautiful Soup,
##  whose web site proclaims:
##      Beautiful Soup is a Python HTML/XML parser
##      designed for quick turnaround projects like
##      screen-scraping.
##  Looks like just what I need, maybe I can figure it out after all.

from BeautifulSoup import BeautifulSoup

target_dates = [['4','6','2008','April']]

####con = odbc.odbc("IMDB")  # connect to MS-Access database
####cursor = con.cursor()

for d in target_dates:
  #
  # build url (with CGI parameters) from list of dates needing
updating
  #
  the_year = d[2]
  the_date = '/'.join([d[0],d[1],d[2]])
  print '%10s scraping IMDB:'  % (the_date),
  the_url = ''.join([r'http://www.imdb.com/BusinessThisDay?
day=',d[1],'&month=',d[3]])
  req = urllib2.Request(url=the_url)
  f = urllib2.urlopen(req)
  www = f.read()
  #
  # ok, page captured. now make a BeatifulSoup object from it
  #
  soup = BeautifulSoup(www)
  #
  # that was easy, much more so than HTMLParser
  #
  # now, _all_ I have to do is figure out how to parse it
  #
  # ouch again. this is a lot harder than it looks in the
  # documentation. I need to get the data from cells of a
  # table nested inside another table and that's hard to
  # extrapolate from the examples showing how to find all
  # the comments on a web page.
  #
  # but this looks promising. if I grab all the table rows
  # (tr tags), each complete nested table is inside a cell
  # of the outer table (whose table tags are lost, but aren't
  # needed and whose absence makes extracting the nested
  # tables easier (when you do it the stupid way, but hey,
  # it works, so I'm sticking with it))
  #
  tr = soup.tr                          # table rows
  tr.extract()
  #
  # now, I only want the third nested table. how do I get it?
  # can't seem to get past the first one, should I be using
  # NextSibling or something? <scratches head...>
  #
  # but wait...I don't need the first two tables, so I can
  # simply extract and discard them. and since .extract()
  # CUTS the tables, after two extractions the table I want
  # IS the first one.
  #
  the_table = tr.find('table')          # discard
  the_table.extract()
  the_table = tr.find('table')          # discard
  the_table.extract()
  the_table = tr.find('table')          # weekly gross
  the_table.extract()
  #
  # of course, the data doesn't start in the first row,
  # there's formatting, header rows, etc. looks like it starts
  # in tr number [3]
  #
  ##  >>> the_table.contents[3].td
  ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
  #
  # and since tags always imply the first one, the above
  # is equivalent to
  #
  ##  >>> the_table.contents[3].contents[0]
  ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
  #
  # and since the title is the first of three cells, the
  # reporting year is
  #
  ##  >>> the_table.contents[3].contents[1]
  ##  <td> <a href="/Sections/Years/2001">2001</a> </td>
  #
  # finally, the 3rd cell must contain the gross
  #
  ##  >>> the_table.contents[3].contents[2]
  ##  <td align="RIGHT"> 259,674,120</td>
  #
  # but the contents of the first two cells are anchor tags.
  # to get the actual title string, I need the contents of the
  # contents. but that's not exactly what I want either,
  # I don't want a list, I need a string. and the string isn't
  # always in the same place in the list
  #
  # summarizing, what I need is
  #
  ##  print the_table.contents[3].contents[0].contents[0].contents,
  ##  print the_table.contents[3].contents[1].contents[1].contents,
  ##  print the_table.contents[3].contents[2].contents
  #
  # and that almost works, just a couple more tweaks and I can
  # shove it into the database

  parsed = []

  for rec in the_table.contents[3:]:
    the_rec_type = type(rec)                      # some rec are
NavSrings, skip
    if str(the_rec_type) == "<type 'instance'>":
      #
      # ok, got a real data row
      #
      TITLE_DATE = rec.contents[0].contents[0].contents   # a list
inside a tuple
      #
      # and that means we still have to index the contents
      # of the contents of the contents of the contents by
      # adding [0][0] to TITLE_DATE
      #
      YEAR =  rec.contents[1].contents[1].contents        # ditto
      #
      # this won't go into the database, just used as a filter to grab
      # the records associated with the posting date and discard
      # the others (which should already be in the database)
      #
      GROSS = rec.contents[2].contents                    # just a
list
      #
      # one other minor glitch, that film date is part of the title
      # (which is of no use in the database, so it has to be pulled
out
      # and put in a separate field
      #
#      temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE[0][0]))
      temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE))
      #
      # which works 99% of the time. unfortunately, the IMDB
      # consitency is somewhat dubious. the date is _supposed_
      # to be at the end of the string, but sometimes it's not.
      # so, usually, there are only 5 groups, but you have to
      # allow for the fact that there may be 6
      #
      try:
        the_title = temp_title.group(1) + temp_title.group(5)
      except:
        the_title = temp_title.group(1)
      the_gross = str(GROSS[0])
      #
      # and for some unexplained reason, dates will occasionally
      # be 2001/I instead of 2001, so we want to discard the trailing
      # crap, if any
      #
      the_film_year = temp_title.group(3)[:4]
#      if str(YEAR[0][0])==the_year:
      if str(YEAR[0])==the_year:
        parsed.append([the_date,the_title,the_film_year,the_gross])

  print '%3d records found ' % (len(parsed))
  #
  # wow, now just have to insert all the update records directly
  # into the database...into a temporary table, of course. as I said,
  # IMDB consistency is somewhat dubious (such as changing the
spelling
  # of the titles), so a QC check will be required inside Access
  #
####  if len(parsed)>0:
####    print '...inserting into database'
####    for p in parsed:
####      cursor.execute("""
####INSERT INTO imdweeks2 ( Date_reported, Title, Film_Date,
Gross_to_Date )
####SELECT ?,?,?,?;""",p)
####  else:
####    print '...aborting, no records found'
####
####cursor.close()
####con.close()

  for p in parsed: print p

# and just because it works, doesn't mean it's right.
# but hey, you get what you pay for. I'm _sure_ if I were
# to pay for a subscription to IMDBPro, I wouldn't see
# these errors ;-)

##You should get this:
##
##  4/6/2008 scraping IMDB: 111 records found
##['4/6/2008', "[u'I Am Legend']", '2007', ' 256,386,216']
##['4/6/2008', "[u'National Treasure: Book of Secrets']", '2007', '
218,701,477']
##['4/6/2008', "[u'Alvin and the Chipmunks']", '2007', ' 216,873,487']
##['4/6/2008', "[u'Juno']", '2007', ' 142,545,706']
##['4/6/2008', "[u'Horton Hears a Who!']", '2008', ' 131,076,768']
##['4/6/2008', "[u'Bucket List, The']", '2007', ' 91,742,612']
##['4/6/2008', "[u'10,000 BC']", '2008', ' 89,349,915']
##['4/6/2008', "[u'Cloverfield']", '2008', ' 80,034,302']
##['4/6/2008', "[u'Jumper']", '2008', ' 78,762,148']
##['4/6/2008', "[u'27 Dresses']", '2008', ' 76,376,607']
##['4/6/2008', "[u'No Country for Old Men']", '2007', ' 74,273,505']
##['4/6/2008', "[u'Vantage Point']", '2008', ' 71,037,105']
##['4/6/2008', "[u'Spiderwick Chronicles, The']", '2008', '
69,872,230']
##['4/6/2008', '[u"Fool\'s Gold"]', '2008', ' 68,636,484']
##['4/6/2008', "[u'Hannah Montana/Miley Cyrus: Best of Both Worlds
Concert Tour']", '2008', ' 65,010,561']
##['4/6/2008', "[u'Step Up 2: The Streets']", '2008', ' 57,389,556']
##['4/6/2008', "[u'Atonement']", '2007', ' 50,921,738']
##['4/6/2008', "[u'21']", '2008', ' 46,770,173']
##['4/6/2008', "[u'College Road Trip']", '2008', ' 40,918,686']
##['4/6/2008', "[u'There Will Be Blood']", '2007', ' 40,133,435']
##['4/6/2008', "[u'Meet the Spartans']", '2008', ' 38,185,300']
##['4/6/2008', "[u'Meet the Browns']", '2008', ' 37,662,502']
##['4/6/2008', "[u'Deep Sea 3D']", '2006', ' 36,141,373']
##['4/6/2008', "[u'Semi-Pro']", '2008', ' 33,289,722']
##['4/6/2008', "[u'Definitely, Maybe']", '2008', ' 31,973,840']
##['4/6/2008', "[u'Eye, The']", '2008', ' 31,397,498']
##['4/6/2008', "[u'Great Debaters, The']", '2007', ' 30,219,326']
##['4/6/2008', "[u'Bank Job, The']", '2008', ' 26,804,821']
##['4/6/2008', "[u'Other Boleyn Girl, The']", '2008', ' 26,051,195']
##['4/6/2008', "[u'Drillbit Taylor']", '2008', ' 25,490,483']
##['4/6/2008', "[u'Magnificent Desolation: Walking on the Moon 3D']",
'2005', ' 23,283,158']
##['4/6/2008', "[u'Shutter']", '2008', ' 23,138,277']
##['4/6/2008', "[u'Never Back Down']", '2008', ' 23,080,675']
##['4/6/2008', "[u'Mad Money']", '2008', ' 20,648,442']
##['4/6/2008', "[u'Galapagos']", '1955', ' 17,152,405']
##['4/6/2008', "[u'Superhero Movie']", '2008', ' 16,899,661']
##['4/6/2008', "[u'Wild Safari 3D']", '2005', ' 16,550,933']
##['4/6/2008', "[u'Kite Runner, The']", '2007', ' 15,790,223']
##['4/6/2008', '[u"Nim\'s Island"]', '2008', ' 13,210,579']
##['4/6/2008', "[u'Leatherheads']", '2008', ' 12,682,595']
##['4/6/2008', "[u'Be Kind Rewind']", '2008', ' 11,028,439']
##['4/6/2008', "[u'Doomsday']", '2008', ' 10,955,425']
##['4/6/2008', "[u'Sea Monsters: A Prehistoric Adventure']", '2007', '
10,745,308']
##['4/6/2008', "[u'Miss Pettigrew Lives for a Day']", '2008', '
10,534,800']
##['4/6/2008', "[u'Môme, La']", '2007', ' 10,299,782']
##['4/6/2008', "[u'Penelope']", '2006', ' 9,646,154']
##['4/6/2008', "[u'Misma luna, La']", '2007', ' 8,959,462']
##['4/6/2008', "[u'Roving Mars']", '2006', ' 8,463,161']
##['4/6/2008', "[u'Stop-Loss']", '2008', ' 8,170,755']
##['4/6/2008', "[u'Ruins, The']", '2008', ' 8,003,241']
##['4/6/2008', "[u'Bella']", '2006', ' 7,776,080']
##['4/6/2008', "[u'U2 3D']", '2007', ' 7,348,105']
##['4/6/2008', "[u'Orfanato, El']", '2007', ' 7,159,147']
##['4/6/2008', "[u'In Bruges']", '2008', ' 6,831,761']
##['4/6/2008', "[u'Savages, The']", '2007', ' 6,571,599']
##['4/6/2008', "[u'Scaphandre et le papillon, Le']", '2007', '
5,990,075']
##['4/6/2008', "[u'Run Fatboy Run']", '2007', ' 4,430,583']
##['4/6/2008', "[u'Persepolis']", '2007', ' 4,200,980']
##['4/6/2008', "[u'Charlie Bartlett']", '2007', ' 3,928,412']
##['4/6/2008', "[u'Jodhaa Akbar']", '2008', ' 3,434,629']
##['4/6/2008', "[u'Fälscher, Die']", '2007', ' 2,903,370']
##['4/6/2008', "[u'Bikur Ha-Tizmoret']", '2007', ' 2,459,543']
##['4/6/2008', "[u'Shine a Light']", '2008', ' 1,488,081']
##['4/6/2008', "[u'Race']", '2008', ' 1,327,606']
##['4/6/2008', "[u'Funny Games U.S.']", '2007', ' 1,274,055']
##['4/6/2008', "[u'4 luni, 3 saptamâni si 2 zile']", '2007', '
1,103,315']
##['4/6/2008', "[u'Married Life']", '2007', ' 1,002,318']
##['4/6/2008', "[u'Diary of the Dead']", '2007', ' 893,192']
##['4/6/2008', "[u'Starting Out in the Evening']", '2007', ' 882,518']
##['4/6/2008', "[u'Dolphins and Whales 3D: Tribes of the Ocean']",
'2008', ' 854,304']
##['4/6/2008', "[u'Sukkar banat']", '2007', ' 781,954']
##['4/6/2008', "[u'Bonneville']", '2006', ' 471,679']
##['4/6/2008', "[u'Flawless']", '2007', ' 390,892']
##['4/6/2008', "[u'Paranoid Park']", '2007', ' 387,119']
##['4/6/2008', "[u'Teeth']", '2007', ' 321,732']
##['4/6/2008', "[u'Hammer, The']", '2007', ' 321,579']
##['4/6/2008', "[u'Priceless']", '2008', ' 320,131']
##['4/6/2008', "[u'Steep']", '2007', ' 259,840']
##['4/6/2008', "[u'Honeydripper']", '2007', ' 259,192']
##['4/6/2008', "[u'Snow Angels']", '2007', ' 255,147']
##['4/6/2008', "[u'Taxi to the Dark Side']", '2007', ' 231,743']
##['4/6/2008', "[u'Cheung Gong 7 hou']", '2008', ' 188,067']
##['4/6/2008', "[u'Ne touchez pas la hache']", '2007', ' 184,513']
##['4/6/2008', "[u'Sleepwalking']", '2008', ' 160,715']
##['4/6/2008', "[u'Chicago 10']", '2007', ' 149,456']
##['4/6/2008', "[u'Girls Rock!']", '2007', ' 92,636']
##['4/6/2008', "[u'Beaufort']", '2007', ' 87,339']
##['4/6/2008', "[u'Shelter']", '2007', ' 85,928']
##['4/6/2008', "[u'My Blueberry Nights']", '2007', ' 74,146']
##['4/6/2008', "[u'Témoins, Les']", '2007', ' 71,624']
##['4/6/2008', "[u'Mépris, Le']", '1963', ' 70,761']
##['4/6/2008', "[u'Singing Revolution, The']", '2006', ' 66,482']
##['4/6/2008', "[u'Chop Shop']", '2007', ' 58,858']
##['4/6/2008', '[u"Chansons d\'amour, Les"]', '2007', ' 58,577']
##['4/6/2008', "[u'Praying with Lior']", '2007', ' 57,325']
##['4/6/2008', "[u'Yihe yuan']", '2006', ' 57,155']
##['4/6/2008', "[u'Casa de Alice, A']", '2007', ' 53,700']
##['4/6/2008', "[u'Blindsight']", '2006', ' 51,256']
##['4/6/2008', "[u'Boarding Gate']", '2007', ' 37,107']
##['4/6/2008', "[u'Voyage du ballon rouge, Le']", '2007', ' 35,222']
##['4/6/2008', "[u'Bill']", '2007', ' 35,201']
##['4/6/2008', "[u'Mio fratello è figlio unico']", '2007', '
34,138']
##['4/6/2008', "[u'Chapter 27']", '2007', ' 32,602']
##['4/6/2008', "[u'Meduzot']", '2007', ' 25,352']
##['4/6/2008', "[u'Shotgun Stories']", '2007', ' 25,346']
##['4/6/2008', "[u'Sconosciuta, La']", '2006', ' 18,569']
##['4/6/2008', "[u'Imaginary Witness: Hollywood and the Holocaust']",
'2004', ' 18,475']
##['4/6/2008', "[u'Irina Palm']", '2007', ' 14,214']
##['4/6/2008', "[u'Naissance des pieuvres']", '2007', ' 7,418']
##['4/6/2008', "[u'Four Letter Word, A']", '2007', ' 6,017']
##['4/6/2008', "[u'Tuya de hun shi']", '2006', ' 2,619']