Parsing an HTML a tag

George Sakkis gsakkis at rutgers.edu
Sat Sep 24 21:06:48 EDT 2005


"George" <buffer_88 at hotmail.com> wrote:

> I'm very new to python and I have tried to read the tutorials but I am
> unable to understand exactly how I must do this problem.
>
> Specifically, the showIPnums function takes a URL as input, calls the
> read_page(url) function to obtain the entire page for that URL, and
> then lists, in sorted order, the IP addresses implied in the "<A
> HREF=· · ·>" tags within that page.
>
>
> """
> Module to print IP addresses of tags in web file containing HTML
>
> >>> showIPnums('http://22c118.cs.uiowa.edu/uploads/easy.html')
> ['0.0.0.0', '128.255.44.134', '128.255.45.54']
>
> >>> showIPnums('http://22c118.cs.uiowa.edu/uploads/pytorg.html')
> ['0.0.0.0', '128.255.135.49', '128.255.244.57', '128.255.30.11',
> '128.255.34.132', '128.255.44.51', '128.255.45.53',
> '128.255.45.54', '129.255.241.42', '64.202.167.129']
>
> """
>
> def read_page(url):
>  import formatter
>  import htmllib
>  import urllib
>
>  htmlp = htmllib.HTMLParser(formatter.NullFormatter())
>  htmlp.feed(urllib.urlopen(url).read())
>  htmlp.close()
>
> def showIPnums(URL):
>  page=read_page(URL)
>
> if __name__ == '__main__':
>  import doctest, sys
>  doctest.testmod(sys.modules[__name__])


You forgot to mention that you don't want duplicates in the result. Here's a function that passes
the doctest:

from urllib import urlopen
from urlparse import urlsplit
from socket import gethostbyname
from BeautifulSoup import BeautifulSoup

def showIPnums(url):
    """Return the unique IPs found in the anchors of the webpage at the given
    url.

    >>> showIPnums('http://22c118.cs.uiowa.edu/uploads/easy.html')
    ['0.0.0.0', '128.255.44.134', '128.255.45.54']
    >>> showIPnums('http://22c118.cs.uiowa.edu/uploads/pytorg.html')
    ['0.0.0.0', '128.255.135.49', '128.255.244.57', '128.255.30.11', '128.255.34.132',
'128.255.44.51', '128.255.45.53', '128.255.45.54', '129.255.241.42', '64.202.167.129']
    """
    hrefs = set()
    for link in BeautifulSoup(urlopen(url)).fetch('a'):
        try: hrefs.add(gethostbyname(urlsplit(link["href"])[1]))
        except: pass
    return sorted(hrefs)


HTH,
George





More information about the Python-list mailing list