Parsing links within a html file.

Hai Vu wuhrrr at gmail.com
Thu Jan 17 02:48:51 EST 2008


On Jan 14, 9:59 am, Shriphani <shripha... at gmail.com> wrote:
> Hello,
> I have a html file over here by the name guide_ind.html and it
> contains links to other html files like guides.html#outline . How do I
> point BeautifulSoup (I want to use this module) to
> guides.html#outline ?
> Thanks
> Shriphani P.

Try Mark Pilgrim's excellent example at:
http://www.diveintopython.org/http_web_services/index.html

>From the above link, you can retrieve openanything.py which I use in
my example:

# list_url.py
# created by Hai Vu on 1/16/2008

from openanything import fetch
from sgmllib import SGMLParser

class RetrieveURLs(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attributes):
        url = [v for k, v in attributes if k.lower() == 'href']
        self.urls.extend(url)
        print '\t%s' % (url)

#
--------------------------------------------------------------------------------------------------------------
# main
def main():
    site = 'http://www.google.com'

    result = fetch(site)
    if result['status'] == 200:
        # Extracts a list of URLs off the top page
        parser = RetrieveURLs()
        parser.feed(result['data'])
        parser.close()

        # Display the URLs we just retrieved
        print '\nURL retrieved from %s' % (site)
        print '\t' + '\n\t'.join(parser.urls)
    else:
        print 'Error (%d) retrieving %s' % (result['status'], site)

if __name__ == '__main__':
    main()



More information about the Python-list mailing list