Please help with Python Script/MS-Access (DAO)

Kirby Urner urnerk at qwest.net
Mon Jun 17 20:09:57 EDT 2002


The script below uses the sgml parser, which is tag
aware, and so ignores them in the output data.  I 
call the below with a web page as a parameter, assuming
the hard-coded base url, e.g.

$ ./testurl.py Plant1.html

and data.txt looks like this (... = skipped lines):


 OrderID,OrderDate,PlantID,ProductID,OrderQty
 19034,4/1/02,1,7,1732
 19035,4/1/02,1,9,1888
 19036,4/1/02,1,4,1048
 19037,4/1/02,1,5,1708
 19038,4/1/02,1,6,876
 ...
 19411,4/27/02,1,10,1288
 19412,4/27/02,1,8,1732
 19413,4/27/02,1,1,732
 19414,4/27/02,1,2,236
 19415,4/27/02,1,9,1596

Now it's up to you in another function/module to read
this downloaded file and knock off the header.  If you 
don't trust the 4-lines pattern, you could trigger off
getting a numeric (digit) as the first non-blank or
whatever.  At least you don't have the <p> and other
tags to mess with, thanks to the sgml parser stripping
'em already.

Kirby


=======================

#!/usr/bin/python

# with thanks to Fredrik Lundh, Python Standard Library (O'Reilly)

import urllib,sys
import sgmllib

class FoundEnd(Exception):
    pass

class Extract(sgmllib.SGMLParser):

    def __init__(self,verbose=0):
        sgmllib.SGMLParser.__init__(self,verbose)
        self.data = []

    def handle_data(self,data):
        self.data.append(data)

    def start_body(self,attr):
        print "Body Start"

    def end_body(self):
        print "Body End"
        raise FoundEnd

def getwebdata(wp):
    p  = Extract()
    n = 0
    try:
        while 1:
            s = wp.read(512)
            if not s:
                break
            p.feed(s)
        p.close()
    except FoundEnd:
        return p.data
    return None

if __name__ == '__main__':
    webpage = sys.argv[1]
    baseurl = "http://opim.wharton.upenn.edu/~opim101/spring02/"
    fp = urllib.urlopen(baseurl + webpage)
    output = open("data.txt","w")
    results = getwebdata(fp)
    fp.close()

    if results:
        for i in results:
            output.write(i)
    output.close()




More information about the Python-list mailing list