Help on regular expression match

Johnny Lee johnnyandfiona at hotmail.com
Fri Sep 23 03:09:17 EDT 2005


Fredrik Lundh wrote:
> ".*" gives the longest possible match (you can think of it as searching back-
> wards from the right end).  if you want to search for "everything until a given
> character", searching for "[^x]*x" is often a better choice than ".*x".
>
> in this case, I suggest using something like
>
>     print re.findall("href=\"([^\"]+)\"", text)
>
> or, if you're going to parse HTML pages from many different sources, a
> real parser:
>
>     from HTMLParser import HTMLParser
>
>     class MyHTMLParser(HTMLParser):
>
>         def handle_starttag(self, tag, attrs):
>             if tag == "a":
>                 for key, value in attrs:
>                     if key == "href":
>                         print value
>
>     p = MyHTMLParser()
>     p.feed(text)
>     p.close()
>
> see:
>
>     http://docs.python.org/lib/module-HTMLParser.html
>     http://docs.python.org/lib/htmlparser-example.html
>     http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html
>
> </F>

Thanks for your help.
I found another solution by just simply adding a '?' after ".*" which
makes the it searching for the minimal length to match the regular
expression.
To the HTMLParser, there is another problem (take my code for example):

import urllib
import formatter
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(urllib.urlopen(baseUrl).read())
parser.close()
for url in parser.anchorlist:
	if url[0:7] == "http://":
		print url

when the baseUrl="http://www.nba.com", there will raise an
HTMLParseError because of a line of code "<! Copyright IBM Corporation,
2001, 2002 !>". I found that this line of code is inside <script> tags,
maybe it's because of this?




More information about the Python-list mailing list