how would you...?

Mensanator mensanator at aol.com
Sat May 17 20:50:56 EDT 2008


On May 17, 4:02�am, Sanoski <Joshuajr... at gmail.com> wrote:
> The reason I ask about text files is the need to save the data
> locally, and have it stored in a way where backups can easily
> be made.

Sure, you can always do that if you want. But if your target
is SQLlite or MS-Access, those are files also, so can be
backed up as easily as text files.

>
> Then if your computer crashes and you lose everything, but
> you have the data files it uses backed up, you can just
> download the program, extract the backed up data to a
> specific directory, and then it works exactly the way it
> did before you lost it. I suppose a SQLite database might
> solve this, but I'm not sure.

It will. Remember, once in a database, you have value-added
features like filtering, sorting, etc. that you would have
to do yourself if you simply read in text files.

> I'm just getting started, and I
> don't know too much about it yet.

Trust me, a database is the way to go.
My preference is MS-Access, because I need it for work.
It is a great tool for learning databases because it's
visual inteface can make you productive BEFORE you learn
SQL.

>
> I'm also still not sure how to download and associate the pictures
> that each entry has for it.

See example at end of post.

> The main thing for me now is getting
> started. It needs to get information from the web. In this case,
> it's a simple XML feed.

BeautifulSoup also has an XML parser. Got to their
web page and read the documentation.

> The one thing that seems that would
> make it easier is every post to the feed is very consistent.
> Each header starts with the letter A, which stands for Alpike
> Tech, follow by the name of the class, the room number, the
> leading student, and his GPA. All that is one line of text.
> But it's also a link to more information. For example:
>
> A Economics, 312, John Carbroil, 4.0
> That's one whole post to the feed. Like I say, it's very
> simple and consistent. Which should make this easier.

That's what you want for parsing, how to seperate
a composite set of data. Simple can sometimes be
done with split(), complex with regular expressions.

>
> Eventually I want it to follow that link and grab information
> from there too, but I'll worry about that later. Technically,
> if I figure this first part out, that problem should take
> care of itself.
>


A sample picture scraper:

from BeautifulSoup import BeautifulSoup
import urllib2
import urllib

#
# start by scraping the web page
#
the_url="http://members.aol.com/mensanator/OHIO/TheCobs.htm"
req = urllib2.Request(url=the_url)
f = urllib2.urlopen(req)
www = f.read()
soup = BeautifulSoup(www)
print soup.prettify()

#
# a simple page with pictures
#
##<html>
## <head>
##  <title>
##   Ohio - The Cobs!
##  </title>
## </head>
## <body>
##  <h1>
##   Ohio Vacation Pictures - The Cobs!
##  </h1>
##  <hr />
##  <img src="AUT_2784.JPG" />
##  <br />
##  WTF?
##  <p>
##   <img src="AUT_2764.JPG" />
##   <br />
##   This is surreal.
##  </p>
##  <p>
##   <img src="AUT_2765.JPG" />
##   <br />
##   Six foot tall corn cobs made of concrete.
##  </p>
##  <p>
##   <img src="AUT_2766.JPG" />
##   <br />
##   109 of them, laid out like a modern Stonehenge.
##  </p>
##  <p>
##   <img src="AUT_2769.JPG" />
##   <br />
##   With it's own Druid worshippers.
##  </p>
##  <p>
##   <img src="AUT_2781.JPG" />
##   <br />
##   Cue the
##   <i>
##    Also Sprach Zarathustra
##   </i>
##   soundtrack.
##  </p>
##  <p>
##   <img src="100_0887.JPG" />
##   <br />
##   Air & Space Museums are a dime a dozen.
##   <br />
##   But there's only
##   <b>
##    one
##   </b>
##   Cobs!
##  </p>
##  <p>
##  </p>
## </body>
##</html>

#
# parse the page to find all the pictures (image tags)
#
the_pics = soup.findAll('img')

for i in the_pics:
  print i

##<img src="AUT_2784.JPG" />
##<img src="AUT_2764.JPG" />
##<img src="AUT_2765.JPG" />
##<img src="AUT_2766.JPG" />
##<img src="AUT_2769.JPG" />
##<img src="AUT_2781.JPG" />
##<img src="100_0887.JPG" />

#
# the picutres have no path, so must be in the
# same directory as the web page
#
the_jpg_path="http://members.aol.com/mensanator/OHIO/"

#
# now with urllib, copy the picture files to the local
# hard drive renaming with sequence id at the same time
#
for i,j in enumerate(the_pics):
  p = the_jpg_path + j['src']
  q = 'C:\\scrape\\' + 'pic' + str(i).zfill(4) + '.jpg'
  urllib.urlretrieve(p,q)

#
# and here's the captured files
#
##  C:\>dir scrape
##   Volume in drive C has no label.
##   Volume Serial Number is D019-C60D
##
##   Directory of C:\scrape
##
##  05/17/2008  07:06 PM    <DIR>          .
##  05/17/2008  07:06 PM    <DIR>          ..
##  05/17/2008  07:05 PM            69,877 pic0000.jpg
##  05/17/2008  07:05 PM            71,776 pic0001.jpg
##  05/17/2008  07:05 PM            70,958 pic0002.jpg
##  05/17/2008  07:05 PM            69,261 pic0003.jpg
##  05/17/2008  07:05 PM            70,653 pic0004.jpg
##  05/17/2008  07:05 PM            70,564 pic0005.jpg
##  05/17/2008  07:05 PM           113,356 pic0006.jpg
##                 7 File(s)        536,445 bytes
##                 2 Dir(s)  27,823,570,944 bytes free



More information about the Python-list mailing list