how would you...?
Mensanator
mensanator at aol.com
Sat May 17 01:08:56 EDT 2008
On May 16, 11:43�pm, Sanoski <Joshuajr... at gmail.com> wrote:
> I'm pretty new to programming. I've just been studying a few weeks off
> and on. I know a little, and I'm learning as I go. Programming is so
> much fun! I really wish I would have gotten into it years ago, but
> here's my question. I have a longterm project in mind, and I wont to
> know if it's feasible and how difficult it will be.
>
> There's an XML feed for my school that some other class designed. It's
> just a simple idea that lists all classes, room number, and the person
> with the highest GPA. The feed is set up like this. Each one of the
> following lines would also be a link to more information about the
> class, etc.
>
> Economics, Room 216, James Faker, 3.4
> Social Studies, Room 231, Brain Fictitious, 3.5
>
> etc, etc
>
> The student also has a picture reference that depicts his GPA based on
> the number. The picture is basically just a graph. I just want to
> write a program that uses the information on this feed.
>
> I want it to reach out to this XML feed, record each instance of the
> above format along with the picture reference of the highest GPA
> student, download it locally, and then be able to use that information
> in various was. I figured I'll start by counting each instance. For
> example, the above would be 2 instances.
>
> Eventually, I want it to be able to cross reference data you've
> already downloaded, and be able to compare GPA's, etc. It would have a
> GUI and everything too, but I am trying to keep it simple right now,
> and just build onto it as I learn.
>
> So lets just say this. How do you grab information from the web,
Depends on the web page.
> in this case a feed,
Haven't tried that, just a simple CGI.
> and then use that in calculations?
The key is some type of structure be it database records,
or a list of lists or whatever. Something that you can iterate
through, sort, find max element, etc.
> How would you
> implement such a project?
The example below uses BeautifulSoup. I'm posting it not
because it matches your problem, but to give you an idea of
the techniques involved.
> Would you save the information into a text file?
Possibly, but generally no. Text files aren't very useful
except as a data exchange media.
> Or would you use something else?
Your application lends itself to a database approach.
Note in my example the database part of the code is disabled.
Not every one has MS-Access on Windows.
> Should I study up on SQLite?
Yes. The MS-Access code I have can be easily changed to SQLlite.
> Maybe I should study classes.
I don't know, but I've always gotten along without them.
> I'm just not sure. What would be the most effective technique?
Don't know that either as I've only done it once, as follows:
## I was looking in my database of movie grosses I regulary copy
## from the Internet Movie Database and noticed I was _only_ 120
## weeks behind in my updates.
##
## Ouch.
##
## Copying a web page, pasting into a text file, running a perl
## script to convert it into a csv file and manually importing it
## into Access isn't so bad when you only have a couple to do at
## a time. Still, it's a labor intensive process and 120 isn't
## anything to look forwards to.
##
## But I abandoned perl years ago when I took up Python, so I
## can use Python to completely automate the process now.
##
## Just have to figure out how.
##
## There's 3 main tasks: capture the web page, parse the web page
## to extract the data and insert the data into the database.
##
## But I only know how to do the last step, using the odnc tools
## from win32,
####import dbi
####import odbc
import re
## so I snoop around comp.lang.python to pick up some
## hints and keywords on how to do the other two tasks.
##
## Documentation on urllib2 was a bit vague, but got the web page
## after only a ouple mis-steps.
import urllib2
## Unfortunately, HTMLParser remained beyond my grasp (is it
## my imagination or is the quality of the examples in the
## doumentation inversely proportional to the subject
## difficulty?)
##
## Luckily, my bag of hints had a reference to Beautiful Soup,
## whose web site proclaims:
## Beautiful Soup is a Python HTML/XML parser
## designed for quick turnaround projects like
## screen-scraping.
## Looks like just what I need, maybe I can figure it out after all.
from BeautifulSoup import BeautifulSoup
target_dates = [['4','6','2008','April']]
####con = odbc.odbc("IMDB") # connect to MS-Access database
####cursor = con.cursor()
for d in target_dates:
#
# build url (with CGI parameters) from list of dates needing
updating
#
the_year = d[2]
the_date = '/'.join([d[0],d[1],d[2]])
print '%10s scraping IMDB:' % (the_date),
the_url = ''.join([r'http://www.imdb.com/BusinessThisDay?
day=',d[1],'&month=',d[3]])
req = urllib2.Request(url=the_url)
f = urllib2.urlopen(req)
www = f.read()
#
# ok, page captured. now make a BeatifulSoup object from it
#
soup = BeautifulSoup(www)
#
# that was easy, much more so than HTMLParser
#
# now, _all_ I have to do is figure out how to parse it
#
# ouch again. this is a lot harder than it looks in the
# documentation. I need to get the data from cells of a
# table nested inside another table and that's hard to
# extrapolate from the examples showing how to find all
# the comments on a web page.
#
# but this looks promising. if I grab all the table rows
# (tr tags), each complete nested table is inside a cell
# of the outer table (whose table tags are lost, but aren't
# needed and whose absence makes extracting the nested
# tables easier (when you do it the stupid way, but hey,
# it works, so I'm sticking with it))
#
tr = soup.tr # table rows
tr.extract()
#
# now, I only want the third nested table. how do I get it?
# can't seem to get past the first one, should I be using
# NextSibling or something? <scratches head...>
#
# but wait...I don't need the first two tables, so I can
# simply extract and discard them. and since .extract()
# CUTS the tables, after two extractions the table I want
# IS the first one.
#
the_table = tr.find('table') # discard
the_table.extract()
the_table = tr.find('table') # discard
the_table.extract()
the_table = tr.find('table') # weekly gross
the_table.extract()
#
# of course, the data doesn't start in the first row,
# there's formatting, header rows, etc. looks like it starts
# in tr number [3]
#
## >>> the_table.contents[3].td
## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
#
# and since tags always imply the first one, the above
# is equivalent to
#
## >>> the_table.contents[3].contents[0]
## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
#
# and since the title is the first of three cells, the
# reporting year is
#
## >>> the_table.contents[3].contents[1]
## <td> <a href="/Sections/Years/2001">2001</a> </td>
#
# finally, the 3rd cell must contain the gross
#
## >>> the_table.contents[3].contents[2]
## <td align="RIGHT"> 259,674,120</td>
#
# but the contents of the first two cells are anchor tags.
# to get the actual title string, I need the contents of the
# contents. but that's not exactly what I want either,
# I don't want a list, I need a string. and the string isn't
# always in the same place in the list
#
# summarizing, what I need is
#
## print the_table.contents[3].contents[0].contents[0].contents,
## print the_table.contents[3].contents[1].contents[1].contents,
## print the_table.contents[3].contents[2].contents
#
# and that almost works, just a couple more tweaks and I can
# shove it into the database
parsed = []
for rec in the_table.contents[3:]:
the_rec_type = type(rec) # some rec are
NavSrings, skip
if str(the_rec_type) == "<type 'instance'>":
#
# ok, got a real data row
#
TITLE_DATE = rec.contents[0].contents[0].contents # a list
inside a tuple
#
# and that means we still have to index the contents
# of the contents of the contents of the contents by
# adding [0][0] to TITLE_DATE
#
YEAR = rec.contents[1].contents[1].contents # ditto
#
# this won't go into the database, just used as a filter to grab
# the records associated with the posting date and discard
# the others (which should already be in the database)
#
GROSS = rec.contents[2].contents # just a
list
#
# one other minor glitch, that film date is part of the title
# (which is of no use in the database, so it has to be pulled
out
# and put in a separate field
#
# temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE[0][0]))
temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE))
#
# which works 99% of the time. unfortunately, the IMDB
# consitency is somewhat dubious. the date is _supposed_
# to be at the end of the string, but sometimes it's not.
# so, usually, there are only 5 groups, but you have to
# allow for the fact that there may be 6
#
try:
the_title = temp_title.group(1) + temp_title.group(5)
except:
the_title = temp_title.group(1)
the_gross = str(GROSS[0])
#
# and for some unexplained reason, dates will occasionally
# be 2001/I instead of 2001, so we want to discard the trailing
# crap, if any
#
the_film_year = temp_title.group(3)[:4]
# if str(YEAR[0][0])==the_year:
if str(YEAR[0])==the_year:
parsed.append([the_date,the_title,the_film_year,the_gross])
print '%3d records found ' % (len(parsed))
#
# wow, now just have to insert all the update records directly
# into the database...into a temporary table, of course. as I said,
# IMDB consistency is somewhat dubious (such as changing the
spelling
# of the titles), so a QC check will be required inside Access
#
#### if len(parsed)>0:
#### print '...inserting into database'
#### for p in parsed:
#### cursor.execute("""
####INSERT INTO imdweeks2 ( Date_reported, Title, Film_Date,
Gross_to_Date )
####SELECT ?,?,?,?;""",p)
#### else:
#### print '...aborting, no records found'
####
####cursor.close()
####con.close()
for p in parsed: print p
# and just because it works, doesn't mean it's right.
# but hey, you get what you pay for. I'm _sure_ if I were
# to pay for a subscription to IMDBPro, I wouldn't see
# these errors ;-)
##You should get this:
##
## 4/6/2008 scraping IMDB: 111 records found
##['4/6/2008', "[u'I Am Legend']", '2007', ' 256,386,216']
##['4/6/2008', "[u'National Treasure: Book of Secrets']", '2007', '
218,701,477']
##['4/6/2008', "[u'Alvin and the Chipmunks']", '2007', ' 216,873,487']
##['4/6/2008', "[u'Juno']", '2007', ' 142,545,706']
##['4/6/2008', "[u'Horton Hears a Who!']", '2008', ' 131,076,768']
##['4/6/2008', "[u'Bucket List, The']", '2007', ' 91,742,612']
##['4/6/2008', "[u'10,000 BC']", '2008', ' 89,349,915']
##['4/6/2008', "[u'Cloverfield']", '2008', ' 80,034,302']
##['4/6/2008', "[u'Jumper']", '2008', ' 78,762,148']
##['4/6/2008', "[u'27 Dresses']", '2008', ' 76,376,607']
##['4/6/2008', "[u'No Country for Old Men']", '2007', ' 74,273,505']
##['4/6/2008', "[u'Vantage Point']", '2008', ' 71,037,105']
##['4/6/2008', "[u'Spiderwick Chronicles, The']", '2008', '
69,872,230']
##['4/6/2008', '[u"Fool\'s Gold"]', '2008', ' 68,636,484']
##['4/6/2008', "[u'Hannah Montana/Miley Cyrus: Best of Both Worlds
Concert Tour']", '2008', ' 65,010,561']
##['4/6/2008', "[u'Step Up 2: The Streets']", '2008', ' 57,389,556']
##['4/6/2008', "[u'Atonement']", '2007', ' 50,921,738']
##['4/6/2008', "[u'21']", '2008', ' 46,770,173']
##['4/6/2008', "[u'College Road Trip']", '2008', ' 40,918,686']
##['4/6/2008', "[u'There Will Be Blood']", '2007', ' 40,133,435']
##['4/6/2008', "[u'Meet the Spartans']", '2008', ' 38,185,300']
##['4/6/2008', "[u'Meet the Browns']", '2008', ' 37,662,502']
##['4/6/2008', "[u'Deep Sea 3D']", '2006', ' 36,141,373']
##['4/6/2008', "[u'Semi-Pro']", '2008', ' 33,289,722']
##['4/6/2008', "[u'Definitely, Maybe']", '2008', ' 31,973,840']
##['4/6/2008', "[u'Eye, The']", '2008', ' 31,397,498']
##['4/6/2008', "[u'Great Debaters, The']", '2007', ' 30,219,326']
##['4/6/2008', "[u'Bank Job, The']", '2008', ' 26,804,821']
##['4/6/2008', "[u'Other Boleyn Girl, The']", '2008', ' 26,051,195']
##['4/6/2008', "[u'Drillbit Taylor']", '2008', ' 25,490,483']
##['4/6/2008', "[u'Magnificent Desolation: Walking on the Moon 3D']",
'2005', ' 23,283,158']
##['4/6/2008', "[u'Shutter']", '2008', ' 23,138,277']
##['4/6/2008', "[u'Never Back Down']", '2008', ' 23,080,675']
##['4/6/2008', "[u'Mad Money']", '2008', ' 20,648,442']
##['4/6/2008', "[u'Galapagos']", '1955', ' 17,152,405']
##['4/6/2008', "[u'Superhero Movie']", '2008', ' 16,899,661']
##['4/6/2008', "[u'Wild Safari 3D']", '2005', ' 16,550,933']
##['4/6/2008', "[u'Kite Runner, The']", '2007', ' 15,790,223']
##['4/6/2008', '[u"Nim\'s Island"]', '2008', ' 13,210,579']
##['4/6/2008', "[u'Leatherheads']", '2008', ' 12,682,595']
##['4/6/2008', "[u'Be Kind Rewind']", '2008', ' 11,028,439']
##['4/6/2008', "[u'Doomsday']", '2008', ' 10,955,425']
##['4/6/2008', "[u'Sea Monsters: A Prehistoric Adventure']", '2007', '
10,745,308']
##['4/6/2008', "[u'Miss Pettigrew Lives for a Day']", '2008', '
10,534,800']
##['4/6/2008', "[u'Môme, La']", '2007', ' 10,299,782']
##['4/6/2008', "[u'Penelope']", '2006', ' 9,646,154']
##['4/6/2008', "[u'Misma luna, La']", '2007', ' 8,959,462']
##['4/6/2008', "[u'Roving Mars']", '2006', ' 8,463,161']
##['4/6/2008', "[u'Stop-Loss']", '2008', ' 8,170,755']
##['4/6/2008', "[u'Ruins, The']", '2008', ' 8,003,241']
##['4/6/2008', "[u'Bella']", '2006', ' 7,776,080']
##['4/6/2008', "[u'U2 3D']", '2007', ' 7,348,105']
##['4/6/2008', "[u'Orfanato, El']", '2007', ' 7,159,147']
##['4/6/2008', "[u'In Bruges']", '2008', ' 6,831,761']
##['4/6/2008', "[u'Savages, The']", '2007', ' 6,571,599']
##['4/6/2008', "[u'Scaphandre et le papillon, Le']", '2007', '
5,990,075']
##['4/6/2008', "[u'Run Fatboy Run']", '2007', ' 4,430,583']
##['4/6/2008', "[u'Persepolis']", '2007', ' 4,200,980']
##['4/6/2008', "[u'Charlie Bartlett']", '2007', ' 3,928,412']
##['4/6/2008', "[u'Jodhaa Akbar']", '2008', ' 3,434,629']
##['4/6/2008', "[u'Fälscher, Die']", '2007', ' 2,903,370']
##['4/6/2008', "[u'Bikur Ha-Tizmoret']", '2007', ' 2,459,543']
##['4/6/2008', "[u'Shine a Light']", '2008', ' 1,488,081']
##['4/6/2008', "[u'Race']", '2008', ' 1,327,606']
##['4/6/2008', "[u'Funny Games U.S.']", '2007', ' 1,274,055']
##['4/6/2008', "[u'4 luni, 3 saptamâni si 2 zile']", '2007', '
1,103,315']
##['4/6/2008', "[u'Married Life']", '2007', ' 1,002,318']
##['4/6/2008', "[u'Diary of the Dead']", '2007', ' 893,192']
##['4/6/2008', "[u'Starting Out in the Evening']", '2007', ' 882,518']
##['4/6/2008', "[u'Dolphins and Whales 3D: Tribes of the Ocean']",
'2008', ' 854,304']
##['4/6/2008', "[u'Sukkar banat']", '2007', ' 781,954']
##['4/6/2008', "[u'Bonneville']", '2006', ' 471,679']
##['4/6/2008', "[u'Flawless']", '2007', ' 390,892']
##['4/6/2008', "[u'Paranoid Park']", '2007', ' 387,119']
##['4/6/2008', "[u'Teeth']", '2007', ' 321,732']
##['4/6/2008', "[u'Hammer, The']", '2007', ' 321,579']
##['4/6/2008', "[u'Priceless']", '2008', ' 320,131']
##['4/6/2008', "[u'Steep']", '2007', ' 259,840']
##['4/6/2008', "[u'Honeydripper']", '2007', ' 259,192']
##['4/6/2008', "[u'Snow Angels']", '2007', ' 255,147']
##['4/6/2008', "[u'Taxi to the Dark Side']", '2007', ' 231,743']
##['4/6/2008', "[u'Cheung Gong 7 hou']", '2008', ' 188,067']
##['4/6/2008', "[u'Ne touchez pas la hache']", '2007', ' 184,513']
##['4/6/2008', "[u'Sleepwalking']", '2008', ' 160,715']
##['4/6/2008', "[u'Chicago 10']", '2007', ' 149,456']
##['4/6/2008', "[u'Girls Rock!']", '2007', ' 92,636']
##['4/6/2008', "[u'Beaufort']", '2007', ' 87,339']
##['4/6/2008', "[u'Shelter']", '2007', ' 85,928']
##['4/6/2008', "[u'My Blueberry Nights']", '2007', ' 74,146']
##['4/6/2008', "[u'Témoins, Les']", '2007', ' 71,624']
##['4/6/2008', "[u'Mépris, Le']", '1963', ' 70,761']
##['4/6/2008', "[u'Singing Revolution, The']", '2006', ' 66,482']
##['4/6/2008', "[u'Chop Shop']", '2007', ' 58,858']
##['4/6/2008', '[u"Chansons d\'amour, Les"]', '2007', ' 58,577']
##['4/6/2008', "[u'Praying with Lior']", '2007', ' 57,325']
##['4/6/2008', "[u'Yihe yuan']", '2006', ' 57,155']
##['4/6/2008', "[u'Casa de Alice, A']", '2007', ' 53,700']
##['4/6/2008', "[u'Blindsight']", '2006', ' 51,256']
##['4/6/2008', "[u'Boarding Gate']", '2007', ' 37,107']
##['4/6/2008', "[u'Voyage du ballon rouge, Le']", '2007', ' 35,222']
##['4/6/2008', "[u'Bill']", '2007', ' 35,201']
##['4/6/2008', "[u'Mio fratello è figlio unico']", '2007', '
34,138']
##['4/6/2008', "[u'Chapter 27']", '2007', ' 32,602']
##['4/6/2008', "[u'Meduzot']", '2007', ' 25,352']
##['4/6/2008', "[u'Shotgun Stories']", '2007', ' 25,346']
##['4/6/2008', "[u'Sconosciuta, La']", '2006', ' 18,569']
##['4/6/2008', "[u'Imaginary Witness: Hollywood and the Holocaust']",
'2004', ' 18,475']
##['4/6/2008', "[u'Irina Palm']", '2007', ' 14,214']
##['4/6/2008', "[u'Naissance des pieuvres']", '2007', ' 7,418']
##['4/6/2008', "[u'Four Letter Word, A']", '2007', ' 6,017']
##['4/6/2008', "[u'Tuya de hun shi']", '2006', ' 2,619']
More information about the Python-list
mailing list