Q: htmllib

Robert Roy rjroy at takingcontrol.com
Wed Oct 25 15:44:57 EDT 2000


On Mon, 23 Oct 2000 23:04:50 -0500, "Hwanjo Yu" <hwanjoyu at uiuc.edu>
wrote:

>Please, can anyone show me an example code of parsing a html document using
>htmllib to extract all the out-links of html document ?
>Thank you in advance.
>
>
>
Use sgmlib instead.
Bob

#findlinks.py
from sgmllib import SGMLParser
import string

class MySGMLParser(SGMLParser):
    def start_a(self, attrs):
        for attr in attrs:
            if attr[0].lower() == 'href':
                print attr[1]

    def close(self):
        SGMLParser.close(self)


if __name__ == "__main__":
    import sys
    if len(sys.argv) != 2:
        print "usage: python findlinks.py infile"
        raise SystemExit
    infile = sys.argv[1]
    # this is a one shot parser
    p = MySGMLParser()
    p.feed(open(infile).read())
    p.close()
        





More information about the Python-list mailing list