Help on regular expression match

Fri Sep 23 02:35:30 EDT 2005

Johnny Lee wrote:

>   I've met a problem in match a regular expression in python. Hope
> any of you could help me. Here are the details:
>
>   I have many tags like this:
>      xxx<a href="http://xxx.xxx.xxx" xxx>xxx
>      xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
>      xxx<a href="http://xxx.xxx.xxx" xxx>xxx
>      .....
>   And I want to find all the "http://xxx.xxx.xxx" out, so I do it
> like this:
>      httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
>      result = httpPat.findall(data)
>   I use this to observe my output:
>      for i in result:
>         print i[2]
>   Surprisingly I will get some output like this:
>      http://xxx.xxx.xxx">xxx</a>xxx
>   In fact it's filtered from this kind of source:
>      <a href="http://xxx.xxx.xxx">xxx</a>xxx"
>   But some result are right, I wonder how can I get the all the
> answers clean like "http://xxx.xxx.xxx"? Thanks for your help.

".*" gives the longest possible match (you can think of it as searching back-
wards from the right end).  if you want to search for "everything until a given
character", searching for "[^x]*x" is often a better choice than ".*x".

in this case, I suggest using something like

    print re.findall("href=\"([^\"]+)\"", text)

or, if you're going to parse HTML pages from many different sources, a
real parser:

    from HTMLParser import HTMLParser

    class MyHTMLParser(HTMLParser):

        def handle_starttag(self, tag, attrs):
            if tag == "a":
                for key, value in attrs:
                    if key == "href":
                        print value

    p = MyHTMLParser()
    p.feed(text)
    p.close()

see:

    http://docs.python.org/lib/module-HTMLParser.html
    http://docs.python.org/lib/htmlparser-example.html
    http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

</F>