regular expressions: grabbing variables from multiple matches

Thu Jan 4 06:07:21 EST 2001

Heather Lynn White wrote:
> Suppose I have a regular expression to grab all variations on a meta tag,
> and I will want to extract from any matches the name and content values
> for this tag.
>
> I use the following re

alex has already explained how to use the optional "pos"
argument to search forward from the last match.

but supposing you really are out to extract meta tags from an
HTML document, it might be a better idea to use the HTML/SGML
parser in sgmllib:

# extract meta tags from a HTML document
# (based on sgmllib-example-1 in the effbot guide)

import sgmllib

class ExtractMeta(sgmllib.SGMLParser):

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.meta = []

    def do_meta(self, attrs):
        name = content = None
        for k, v in attrs:
            if k == "name":
                name = v
            if k == "content":
                content = v
        if name and content:
            self.meta.append((name, content))

    def end_title(self):
        # ignore meta tags after </title>.  you
        # can comment away this method if you
        # want to parse the entire file
        raise EOFError

def getmeta(file):
    # extract meta tags from an HTML/SGML stream
    p = ExtractMeta()
    try:
        p.feed(file.read())
        p.close()
    except EOFError:
        pass
    return p.meta

#
# try it out

import urllib
print getmeta(urllib.urlopen("http://www.python.org"))

Hope this helps!

Cheers /F

<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->