why does this call to re.findall() loop forever?

Sun Nov 9 18:00:05 EST 2008

Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?

The string contains three records (list items in a html page). Notice
that NONE of them matches the regexp: these records do not contain the
"title" element which the regexp expects inside '<span
class="date">'.

The weird thing is that removing any of the three records makes
findall() immediately return an empty list, while if I pass all three
records to findall() it never returns. Why does this happen?

This is using python 2.6.

Thanks so much for any help

-james

s="""<li class="post" key="4994199a0b80136cb3174e9e875c545e">
 <h4 class="desc"><a href="http://www.sluggy.com/"
rel="nofollow">Sluggy Freelance</a>
 </h4>
 <div class="commands">  <a save href="/post?url=http%3A%2F
%2Fwww.sluggy.com%2F&title=Sluggy
%20Freelance&copyuser=crowebert&copytags=imported%2BRSS
%2BComics%2Bhumor%2Bdaily%2Bwebcomics&jump=no&partner=del"
class="copy" rel="nofollow">save this</a></div> <div class="meta">to
<a class="tag" href="/crowebert/imported">imported</a> <a class="tag"
href="/crowebert/RSS">RSS</a> <a class="tag" href="/crowebert/
Comics">Comics</a> <a class="tag" href="/crowebert/humor">humor</a> <a
class="tag" href="/crowebert/daily">daily</a> <a class="tag" href="/
crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/
ac655d3fe17873b31abeb29a1043e439" style="padding: 0 0.2em; background-
color: rgb(100%, 66%, 66%);">saved by 983 other people</a> <span
class="date">1945-07-18</span> </div>
</li>

<li class="post" key="65d66f4197fc7eba5c214fe85ed77725">
 <h4 class="desc"><a href="http://www.snackbar-games.com/
gbacovers.php" rel="nofollow">Snackbar-Games.com :: GBA DS Cover
Project</a>
 </h4>
 <div class="commands">  <a save href="/post?url=http%3A%2F
%2Fwww.snackbar-games.com%2Fgbacovers.php&title=Snackbar-Games.com
%20%3A%3A%20GBA%20DS%20Cover
%20Project&copyuser=crowebert&copytags=imported%2BBookmarkMenu
%2BGameStuff%2Bart%2BGBA%2Bgames
%2Bnintendo&jump=no&partner=del" class="copy"
rel="nofollow">save this</a></div> <div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a> <a class="tag" href="/
crowebert/BookmarkMenu">BookmarkMenu</a> <a class="tag" href="/
crowebert/GameStuff">GameStuff</a> <a class="tag" href="/crowebert/
art">art</a> <a class="tag" href="/crowebert/GBA">GBA</a> <a
class="tag" href="/crowebert/games">games</a> <a class="tag" href="/
crowebert/nintendo">nintendo</a> ... <a class="pop" href="/url/
a65a4a0ebe813ec6e9c881331e3f9583" style="padding: 0 0.2em; background-
color: rgb(100%, 84%, 84%);">saved by 26 other people</a> <span
class="date">1948-12-31</span> </div>
</li>

<li class="post" key="690ace1f465ae419dee8145ad3871024">
 <h4 class="desc"><a href="http://www.megatokyo.com/"
rel="nofollow">MegaTokyo</a>
 </h4>
 <div class="commands">  <a save href="/post?url=http%3A%2F
%2Fwww.megatokyo.com
%2F&title=MegaTokyo&copyuser=crowebert&copytags=imported
%2BBookmarkBar%2BWeekendComics%2Bcomics%2Bmanga%2Bhumor
%2Bwebcomics&jump=no&partner=del" class="copy"
rel="nofollow">save this</a></div> <div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a> <a class="tag" href="/
crowebert/BookmarkBar">BookmarkBar</a> <a class="tag" href="/crowebert/
WeekendComics">WeekendComics</a> <a class="tag" href="/crowebert/
comics">comics</a> <a class="tag" href="/crowebert/manga">manga</a> <a
class="tag" href="/crowebert/humor">humor</a> <a class="tag" href="/
crowebert/webcomics">webcomics</a> ... <a class="pop" href="/url/
94843244f0c6d80f1c6806ed5c0abec7" style="padding: 0 0.2em; background-
color: rgb(100%, 60%, 60%);">saved by 2784 other people</a> <span
class="date">1946-01-28</span> </div>
</li>"""

regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)