Regex recursion error example.

Fri Nov 1 11:12:59 EST 2002

Yin wrote:
> 
> After tinkering with this issue for a day or so, I've decided to use
> xmllib to solve the problem.  But for future reference, I've attached
> the piece of text that is failing and the two approaches that I've
> tried to make the match.
> 
> Of course there are other approaches to doing this parse, but I am
> interested in understanding the regex approach I am trying and its
> limitations.
> 
> If there are no solutions using regex, I would be interested in seeing
> a reference to articles or books that discuss overcoming particularly
> long string matches.
> 
> Approach 1:
> pattern=re.compile('<PubMedArticle>(.*?)</PubMedArticle>',
> re.DOTALL)
> self.citationlist = re.findall(pattern, allinput)
> 
> Approach 2:
> comppat=re.compile(r'<PubMedArticle>((?:(?!<PubMedArt
> icle>).)*)</PubMedArticle>',
> re.DOTALL)
> self.citationlist = re.findall(pattern, allinput)
> 
> There are three matching to make in this body of text.  The above code
> has been failing on the second of the third.  This problem has only
> been occuring on linux python and Windows python (the stack in Windows
> is just larger enough to accomadate the matches.
> Text to match:
> 
> http://160.129.203.97/1998_xmltest.html
> 
> Please let me know by e-mail if the link is down.
> 
> Thanks again,
> Yin
> -- 

How about this (untested); don't think you will get a recursion problem:

pattern=re.compile("""
   <PubMedArticle>
   (?:[^&]+
   |
   &(?!lt;/PubMedArticle>)
   )*
   </PubMedArticle>"""
   , re.DOTALL, re.VERBOSE)
self.citationlist = pattern.findall(allinput)

Harvey

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.