Please help with Python Script/MS-Access (DAO)
Kirby Urner
urnerk at qwest.net
Mon Jun 17 20:09:57 EDT 2002
The script below uses the sgml parser, which is tag
aware, and so ignores them in the output data. I
call the below with a web page as a parameter, assuming
the hard-coded base url, e.g.
$ ./testurl.py Plant1.html
and data.txt looks like this (... = skipped lines):
OrderID,OrderDate,PlantID,ProductID,OrderQty
19034,4/1/02,1,7,1732
19035,4/1/02,1,9,1888
19036,4/1/02,1,4,1048
19037,4/1/02,1,5,1708
19038,4/1/02,1,6,876
...
19411,4/27/02,1,10,1288
19412,4/27/02,1,8,1732
19413,4/27/02,1,1,732
19414,4/27/02,1,2,236
19415,4/27/02,1,9,1596
Now it's up to you in another function/module to read
this downloaded file and knock off the header. If you
don't trust the 4-lines pattern, you could trigger off
getting a numeric (digit) as the first non-blank or
whatever. At least you don't have the <p> and other
tags to mess with, thanks to the sgml parser stripping
'em already.
Kirby
=======================
#!/usr/bin/python
# with thanks to Fredrik Lundh, Python Standard Library (O'Reilly)
import urllib,sys
import sgmllib
class FoundEnd(Exception):
pass
class Extract(sgmllib.SGMLParser):
def __init__(self,verbose=0):
sgmllib.SGMLParser.__init__(self,verbose)
self.data = []
def handle_data(self,data):
self.data.append(data)
def start_body(self,attr):
print "Body Start"
def end_body(self):
print "Body End"
raise FoundEnd
def getwebdata(wp):
p = Extract()
n = 0
try:
while 1:
s = wp.read(512)
if not s:
break
p.feed(s)
p.close()
except FoundEnd:
return p.data
return None
if __name__ == '__main__':
webpage = sys.argv[1]
baseurl = "http://opim.wharton.upenn.edu/~opim101/spring02/"
fp = urllib.urlopen(baseurl + webpage)
output = open("data.txt","w")
results = getwebdata(fp)
fp.close()
if results:
for i in results:
output.write(i)
output.close()
More information about the Python-list
mailing list