Help on regular expression match
Johnny Lee
johnnyandfiona at hotmail.com
Fri Sep 23 03:09:17 EDT 2005
Fredrik Lundh wrote:
> ".*" gives the longest possible match (you can think of it as searching back-
> wards from the right end). if you want to search for "everything until a given
> character", searching for "[^x]*x" is often a better choice than ".*x".
>
> in this case, I suggest using something like
>
> print re.findall("href=\"([^\"]+)\"", text)
>
> or, if you're going to parse HTML pages from many different sources, a
> real parser:
>
> from HTMLParser import HTMLParser
>
> class MyHTMLParser(HTMLParser):
>
> def handle_starttag(self, tag, attrs):
> if tag == "a":
> for key, value in attrs:
> if key == "href":
> print value
>
> p = MyHTMLParser()
> p.feed(text)
> p.close()
>
> see:
>
> http://docs.python.org/lib/module-HTMLParser.html
> http://docs.python.org/lib/htmlparser-example.html
> http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html
>
> </F>
Thanks for your help.
I found another solution by just simply adding a '?' after ".*" which
makes the it searching for the minimal length to match the regular
expression.
To the HTMLParser, there is another problem (take my code for example):
import urllib
import formatter
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(urllib.urlopen(baseUrl).read())
parser.close()
for url in parser.anchorlist:
if url[0:7] == "http://":
print url
when the baseUrl="http://www.nba.com", there will raise an
HTMLParseError because of a line of code "<! Copyright IBM Corporation,
2001, 2002 !>". I found that this line of code is inside <script> tags,
maybe it's because of this?
More information about the Python-list
mailing list