regular expressions: grabbing variables from multiple matches
Fredrik Lundh
fredrik at effbot.org
Thu Jan 4 06:07:21 EST 2001
Heather Lynn White wrote:
> Suppose I have a regular expression to grab all variations on a meta tag,
> and I will want to extract from any matches the name and content values
> for this tag.
>
> I use the following re
alex has already explained how to use the optional "pos"
argument to search forward from the last match.
but supposing you really are out to extract meta tags from an
HTML document, it might be a better idea to use the HTML/SGML
parser in sgmllib:
# extract meta tags from a HTML document
# (based on sgmllib-example-1 in the effbot guide)
import sgmllib
class ExtractMeta(sgmllib.SGMLParser):
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.meta = []
def do_meta(self, attrs):
name = content = None
for k, v in attrs:
if k == "name":
name = v
if k == "content":
content = v
if name and content:
self.meta.append((name, content))
def end_title(self):
# ignore meta tags after </title>. you
# can comment away this method if you
# want to parse the entire file
raise EOFError
def getmeta(file):
# extract meta tags from an HTML/SGML stream
p = ExtractMeta()
try:
p.feed(file.read())
p.close()
except EOFError:
pass
return p.meta
#
# try it out
import urllib
print getmeta(urllib.urlopen("http://www.python.org"))
Hope this helps!
Cheers /F
<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->
More information about the Python-list
mailing list