[Tutor] RE help

Kent Johnson kent37 at tds.net
Tue Feb 15 21:26:28 CET 2005


Try it with non-greedy matches. You are matching everything from the first <hX><a to the last </p> 
in one match. Also I think you want to escape the . before </p> (you want just paragraphs that end 
in a period?)

pattern = re.compile("""<h[1-2]><a href="/(.*?)">(.*?)\.</p>""", re.DOTALL)

Kent

Ron Nixon wrote:
> Trying to scrape a newspaper site for articles using
> this code whic ws done with help from the list:
> 
> import urllib, re
> pattern = re.compile("""<h[1-2]><a
> href="/(.*)">(.*).</p>""", re.DOTALL)
> page
> =urllib.urlopen("http://www.startribune.com").read()  
> 
> for headline, body in pattern.findall(page):
>     print body
> 
> It should grab articles from this:
> 
> <h2><a href="/stories/507/5240764.html">Sid Hartman:
> Franchise could be moved</a></h2><p>If Reggie Fowler
> and his business partners from New Jersey are approved
> to buy the Vikings franchise from Red McCombs, it is
> my opinion the franchise remains in danger of
> eventually being relocated.</p>
> 
> and give me this: Sid Hartman: Franchise could be
> moved</a></h2><p>If Reggie Fowler and his business
> partners from New Jersey are approved to buy the
> Vikings franchise from Red McCombs, it is my opinion
> the franchise remains in danger of eventually being
> relocated.
> 
> Instead it gives me this:<b>Boxerjam</b></a>. from
> this :
> href="http://www.startribune.com/stories/1559/4773140.html"><b>Boxerjam</b></a>.
> </p></div>
> 
> I know the re works in other programs I've tried. Is
> there something different about re's in Python?
> 
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Mail - Find what you need with new enhanced search.
> http://info.mail.yahoo.com/mail_250
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 



More information about the Tutor mailing list