HTML Parser - beginner needs help

Fredrik Lundh effbot at telia.com
Thu Sep 14 15:40:29 EDT 2000


"zet" wrote:
> Can somebody provide small piece of code, which returns list of  img tags?
> I've trying this lines:
> 
> class IMGParser(HTMLParser):
>  def end_img(arg):
>   return

if you're looking for tags, sgmllib is usually easier to use.
here's an example:

# extract image tags
# (based on sgmllib-example-1.py from the eff-bot guide)

import sgmllib

class ImageParser(sgmllib.SGMLParser):

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.images = []

    def do_img(self, attrs):
        for k, v in attrs:
            if k == "src":
                self.images.append(v)
                break

def extract(file):
    # get img tags from an HTML/SGML stream
    p = ImageParser()
    while 1:
        s = file.read(1024)
        if not s:
            break
        p.feed(s)
    p.close()
    return p.images

#
# try it out

import urllib

print extract(urllib.urlopen("http://www.python.org"))

## prints:
##
## ['./pics/PyBanner011.gif',
##  './pics/PythonPoweredSmall.gif',
##  'pics/pythonHi.gif']

</F>

<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->




More information about the Python-list mailing list