[Tutor] RE help
Kent Johnson
kent37 at tds.net
Tue Feb 15 21:26:28 CET 2005
Try it with non-greedy matches. You are matching everything from the first <hX><a to the last </p>
in one match. Also I think you want to escape the . before </p> (you want just paragraphs that end
in a period?)
pattern = re.compile("""<h[1-2]><a href="/(.*?)">(.*?)\.</p>""", re.DOTALL)
Kent
Ron Nixon wrote:
> Trying to scrape a newspaper site for articles using
> this code whic ws done with help from the list:
>
> import urllib, re
> pattern = re.compile("""<h[1-2]><a
> href="/(.*)">(.*).</p>""", re.DOTALL)
> page
> =urllib.urlopen("http://www.startribune.com").read()
>
> for headline, body in pattern.findall(page):
> print body
>
> It should grab articles from this:
>
> <h2><a href="/stories/507/5240764.html">Sid Hartman:
> Franchise could be moved</a></h2><p>If Reggie Fowler
> and his business partners from New Jersey are approved
> to buy the Vikings franchise from Red McCombs, it is
> my opinion the franchise remains in danger of
> eventually being relocated.</p>
>
> and give me this: Sid Hartman: Franchise could be
> moved</a></h2><p>If Reggie Fowler and his business
> partners from New Jersey are approved to buy the
> Vikings franchise from Red McCombs, it is my opinion
> the franchise remains in danger of eventually being
> relocated.
>
> Instead it gives me this:<b>Boxerjam</b></a>. from
> this :
> href="http://www.startribune.com/stories/1559/4773140.html"><b>Boxerjam</b></a>.
> </p></div>
>
> I know the re works in other programs I've tried. Is
> there something different about re's in Python?
>
>
>
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! Mail - Find what you need with new enhanced search.
> http://info.mail.yahoo.com/mail_250
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list