Q: htmllib
Robert Roy
rjroy at takingcontrol.com
Wed Oct 25 15:44:57 EDT 2000
On Mon, 23 Oct 2000 23:04:50 -0500, "Hwanjo Yu" <hwanjoyu at uiuc.edu>
wrote:
>Please, can anyone show me an example code of parsing a html document using
>htmllib to extract all the out-links of html document ?
>Thank you in advance.
>
>
>
Use sgmlib instead.
Bob
#findlinks.py
from sgmllib import SGMLParser
import string
class MySGMLParser(SGMLParser):
def start_a(self, attrs):
for attr in attrs:
if attr[0].lower() == 'href':
print attr[1]
def close(self):
SGMLParser.close(self)
if __name__ == "__main__":
import sys
if len(sys.argv) != 2:
print "usage: python findlinks.py infile"
raise SystemExit
infile = sys.argv[1]
# this is a one shot parser
p = MySGMLParser()
p.feed(open(infile).read())
p.close()
More information about the Python-list
mailing list