why does this call to re.findall() loop forever?

Mon Nov 10 07:29:59 EST 2008

james.kirin40 at gmail.com <james.kirin40 at gmail.com> wrote:
>  My apologies, given that Google Groups  messes up the formatting, the
>  regexp should read
> 
>  regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
>  href=
>  \"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
>  \">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?> )
>  +))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
>  li>""", re.DOTALL)

Some regular expressions can't be searched in a reasonable length of
time.  Not sure whether this is your problem but it might be!  Search
for "exponential time regular expression" if you want some examples.

Eg http://bugs.python.org/issue1515829

I'd attack this problem using beatifulsoup probably rather than
regexps!

-- 
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick