Help on regular expression match
Fredrik Lundh
fredrik at pythonware.com
Fri Sep 23 02:35:30 EDT 2005
Johnny Lee wrote:
> I've met a problem in match a regular expression in python. Hope
> any of you could help me. Here are the details:
>
> I have many tags like this:
> xxx<a href="http://xxx.xxx.xxx" xxx>xxx
> xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
> xxx<a href="http://xxx.xxx.xxx" xxx>xxx
> .....
> And I want to find all the "http://xxx.xxx.xxx" out, so I do it
> like this:
> httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
> result = httpPat.findall(data)
> I use this to observe my output:
> for i in result:
> print i[2]
> Surprisingly I will get some output like this:
> http://xxx.xxx.xxx">xxx</a>xxx
> In fact it's filtered from this kind of source:
> <a href="http://xxx.xxx.xxx">xxx</a>xxx"
> But some result are right, I wonder how can I get the all the
> answers clean like "http://xxx.xxx.xxx"? Thanks for your help.
".*" gives the longest possible match (you can think of it as searching back-
wards from the right end). if you want to search for "everything until a given
character", searching for "[^x]*x" is often a better choice than ".*x".
in this case, I suggest using something like
print re.findall("href=\"([^\"]+)\"", text)
or, if you're going to parse HTML pages from many different sources, a
real parser:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "a":
for key, value in attrs:
if key == "href":
print value
p = MyHTMLParser()
p.feed(text)
p.close()
see:
http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/htmlparser-example.html
http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html
</F>
More information about the Python-list
mailing list