how would you...?

Sat May 17 05:02:43 EDT 2008

The reason I ask about text files is the need to save the data
locally, and have it stored in a way where backups can easily be made.
Then if your computer crashes and you lose everything, but you have
the data files it uses backed up, you can just download the program,
extract the backed up data to a specific directory, and then it works
exactly the way it did before you lost it. I suppose a SQLite database
might solve this, but I'm not sure. I'm just getting started, and I
don't know too much about it yet.

I'm also still not sure how to download and associate the pictures
that each entry has for it. The main thing for me now is getting
started. It needs to get information from the web. In this case, it's
a simple XML feed. The one thing that seems that would make it easier
is every post to the feed is very consistent. Each header starts with
the letter A, which stands for Alpike Tech, follow by the name of the
class, the room number, the leading student, and his GPA. All that is
one line of text. But it's also a link to more information. For
example:

A Economics, 312, John Carbroil, 4.0
That's one whole post to the feed. Like I say, it's very simple and
consistent. Which should make this easier.

Eventually I want it to follow that link and grab information from
there too, but I'll worry about that later. Technically, if I figure
this first part out, that problem should take care of itself.

On May 17, 1:08 am, Mensanator <mensana... at aol.com> wrote:
> On May 16, 11:43�pm, Sanoski <Joshuajr... at gmail.com> wrote:
>
>
>
> > I'm pretty new to programming. I've just been studying a few weeks off
> > and on. I know a little, and I'm learning as I go. Programming is so
> > much fun! I really wish I would have gotten into it years ago, but
> > here's my question. I have a longterm project in mind, and I wont to
> > know if it's feasible and how difficult it will be.
>
> > There's an XML feed for my school that some other class designed. It's
> > just a simple idea that lists all classes, room number, and the person
> > with the highest GPA. The feed is set up like this. Each one of the
> > following lines would also be a link to more information about the
> > class, etc.
>
> > Economics, Room 216, James Faker, 3.4
> > Social Studies, Room 231, Brain Fictitious, 3.5
>
> > etc, etc
>
> > The student also has a picture reference that depicts his GPA based on
> > the number. The picture is basically just a graph. I just want to
> > write a program that uses the information on this feed.
>
> > I want it to reach out to this XML feed, record each instance of the
> > above format along with the picture reference of the highest GPA
> > student, download it locally, and then be able to use that information
> > in various was. I figured I'll start by counting each instance. For
> > example, the above would be 2 instances.
>
> > Eventually, I want it to be able to cross reference data you've
> > already downloaded, and be able to compare GPA's, etc. It would have a
> > GUI and everything too, but I am trying to keep it simple right now,
> > and just build onto it as I learn.
>
> > So lets just say this. How do you grab information from the web,
>
> Depends on the web page.
>
> > in this case a feed,
>
> Haven't tried that, just a simple CGI.
>
> > and then use that in calculations?
>
> The key is some type of structure be it database records,
> or a list of lists or whatever. Something that you can iterate
> through, sort, find max element, etc.
>
> > How would you
> > implement such a project?
>
> The example below uses BeautifulSoup. I'm posting it not
> because it matches your problem, but to give you an idea of
> the techniques involved.
>
> > Would you save the information into a text file?
>
> Possibly, but generally no. Text files aren't very useful
> except as a data exchange media.
>
> > Or would you use something else?
>
> Your application lends itself to a database approach.
> Note in my example the database part of the code is disabled.
> Not every one has MS-Access on Windows.
>
> > Should I study up on SQLite?
>
> Yes. The MS-Access code I have can be easily changed to SQLlite.
>
> > Maybe I should study classes.
>
> I don't know, but I've always gotten along without them.
>
> > I'm just not sure. What would be the most effective technique?
>
> Don't know that either as I've only done it once, as follows:
>
> ##  I was looking in my database of movie grosses I regulary copy
> ##  from the Internet Movie Database and noticed I was _only_ 120
> ##  weeks behind in my updates.
> ##
> ##  Ouch.
> ##
> ##  Copying a web page, pasting into a text file, running a perl
> ##  script to convert it into a csv file and manually importing it
> ##  into Access isn't so bad when you only have a couple to do at
> ##  a time. Still, it's a labor intensive process and 120 isn't
> ##  anything to look forwards to.
> ##
> ##  But I abandoned perl years ago when I took up Python, so I
> ##  can use Python to completely automate the process now.
> ##
> ##  Just have to figure out how.
> ##
> ##  There's 3 main tasks: capture the web page, parse the web page
> ##  to extract the data and insert the data into the database.
> ##
> ##  But I only know how to do the last step, using the odnc tools
> ##  from win32,
>
> ####import dbi
> ####import odbc
> import re
>
> ##  so I snoop around comp.lang.python to pick up some
> ##  hints and keywords on how to do the other two tasks.
> ##
> ##  Documentation on urllib2 was a bit vague, but got the web page
> ##  after only a ouple mis-steps.
>
> import urllib2
>
> ##  Unfortunately, HTMLParser remained beyond my grasp (is it
> ##  my imagination or is the quality of the examples in the
> ##  doumentation inversely proportional to the subject
> ##  difficulty?)
> ##
> ##  Luckily, my bag of hints had a reference to Beautiful Soup,
> ##  whose web site proclaims:
> ##      Beautiful Soup is a Python HTML/XML parser
> ##      designed for quick turnaround projects like
> ##      screen-scraping.
> ##  Looks like just what I need, maybe I can figure it out after all.
>
> from BeautifulSoup import BeautifulSoup
>
> target_dates = [['4','6','2008','April']]
>
> ####con = odbc.odbc("IMDB")  # connect to MS-Access database
> ####cursor = con.cursor()
>
> for d in target_dates:
>   #
>   # build url (with CGI parameters) from list of dates needing
> updating
>   #
>   the_year = d[2]
>   the_date = '/'.join([d[0],d[1],d[2]])
>   print '%10s scraping IMDB:'  % (the_date),
>   the_url = ''.join([r'http://www.imdb.com/BusinessThisDay?
> day=',d[1],'&month=',d[3]])
>   req = urllib2.Request(url=the_url)
>   f = urllib2.urlopen(req)
>   www = f.read()
>   #
>   # ok, page captured. now make a BeatifulSoup object from it
>   #
>   soup = BeautifulSoup(www)
>   #
>   # that was easy, much more so than HTMLParser
>   #
>   # now, _all_ I have to do is figure out how to parse it
>   #
>   # ouch again. this is a lot harder than it looks in the
>   # documentation. I need to get the data from cells of a
>   # table nested inside another table and that's hard to
>   # extrapolate from the examples showing how to find all
>   # the comments on a web page.
>   #
>   # but this looks promising. if I grab all the table rows
>   # (tr tags), each complete nested table is inside a cell
>   # of the outer table (whose table tags are lost, but aren't
>   # needed and whose absence makes extracting the nested
>   # tables easier (when you do it the stupid way, but hey,
>   # it works, so I'm sticking with it))
>   #
>   tr = soup.tr                          # table rows
>   tr.extract()
>   #
>   # now, I only want the third nested table. how do I get it?
>   # can't seem to get past the first one, should I be using
>   # NextSibling or something? <scratches head...>
>   #
>   # but wait...I don't need the first two tables, so I can
>   # simply extract and discard them. and since .extract()
>   # CUTS the tables, after two extractions the table I want
>   # IS the first one.
>   #
>   the_table = tr.find('table')          # discard
>   the_table.extract()
>   the_table = tr.find('table')          # discard
>   the_table.extract()
>   the_table = tr.find('table')          # weekly gross
>   the_table.extract()
>   #
>   # of course, the data doesn't start in the first row,
>   # there's formatting, header rows, etc. looks like it starts
>   # in tr number [3]
>   #
>   ##  >>> the_table.contents[3].td
>   ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
> (2000)</a> </td>
>   #
>   # and since tags always imply the first one, the above
>   # is equivalent to
>   #
>   ##  >>> the_table.contents[3].contents[0]
>   ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
> (2000)</a> </td>
>   #
>   # and since the title is the first of three cells, the
>   # reporting year is
>   #
>   ##  >>> the_table.contents[3].contents[1]
>   ##  <td> <a href="/Sections/Years/2001">2001</a> </td>
>   #
>   # finally, the 3rd cell must contain the gross
>   #
>   ##  >>> the_table.contents[3].contents[2]
>   ##  <td align="RIGHT"> 259,674,120</td>
>   #
>   # but the contents of the first two cells are anchor tags.
>   # to get the actual title string, I need the contents of the
>   # contents. but that's not exactly what I want either,
>   # I don't want a list, I need a string. and the string isn't
>   # always in the same place in the list
>   #
>   # summarizing, what I need is
>   #
>   ##  print the_table.contents[3].contents[0].contents[0].contents,
>   ##  print the_table.contents[3].contents[1].contents[1].contents,
>   ##  print the_table.contents[3].contents[2].contents
>   #
>   # and that almost works, just a couple more tweaks and I can
>   # shove it into the database
>
>   parsed = []
>
>   for rec in the_table.contents[3:]:
>     the_rec_type = type(rec)                      # some rec are
> NavSrings, skip
>     if str(the_rec_type) == "<type 'instance'>":
>       #
>       # ok, got a real data row
>       #
>       TITLE_DATE = rec.contents[0].contents[0].contents   # a list
> inside a tuple
>       #
>       # and that means we still have to index the contents
>       # of the contents of the contents of the contents by
>       # adding [0][0] to TITLE_DATE
>       #
>       YEAR =  rec.contents[1].contents[1].contents        # ditto
>       #
>       # this won't go into the database, just used as a filter to grab
>       # the records associated with the posting date and discard
>       # the others (which should already be in the database)
>       #
>       GROSS = rec.contents[2].contents                    # just a
> list
>       #
>       # one other minor glitch, that film date is part of the title
>       # (which is of no use in the database, so it has to be pulled
> out
>       # and put in a separate field
>       #
> #      temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
> (.*)',str(TITLE_DATE[0][0]))
>       temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
> (.*)',str(TITLE_DATE))
>       #
>       # which works 99% of the time. unfortunately, the IMDB
>       # consitency is somewhat dubious. the date is _supposed_
>       # to be at the end of the string, but sometimes it's not.
>       # so, usually, there are only 5 groups, but you have to
>       # allow for the fact that there may be 6
>       #
>       try:
>         the_title = temp_title.group(1) + temp_title.group(5)
>       except:
>         the_title = temp_title.group(1)
>       the_gross = str(GROSS[0])
>       #
>       # and for some unexplained reason, dates will occasionally
>       # be 2001/I instead of 2001, so we want to discard the trailing
>       # crap, if any
>       #
>       the_film_year = temp_title.group(3)[:4]
> #      if str(YEAR[0][0])==the_year:
>       if str(YEAR[0])==the_year:
>         parsed.append([the_date,the_title,the_film_year,the_gross])
>
>   print '%3d records found ' % (len(parsed))
>   #
>   # wow, now just have to insert all the update...
>
> read more »