Regex recursion error example.
Bengt Richter
bokr at oz.net
Fri Nov 1 18:59:39 EST 2002
On 1 Nov 2002 07:13:20 -0800, yin_12180 at yahoo.com (Yin) wrote:
>After tinkering with this issue for a day or so, I've decided to use
>xmllib to solve the problem. But for future reference, I've attached
>the piece of text that is failing and the two approaches that I've
>tried to make the match.
>
>Of course there are other approaches to doing this parse, but I am
>interested in understanding the regex approach I am trying and its
>limitations.
>
>If there are no solutions using regex, I would be interested in seeing
>a reference to articles or books that discuss overcoming particularly
>long string matches.
>
>Approach 1:
>pattern=re.compile('<PubMedArticle>(.*?)</PubMedArticle>',
>re.DOTALL)
>self.citationlist = re.findall(pattern, allinput)
>
>Approach 2:
>comppat=re.compile(r'<PubMedArticle>((?:(?!<PubMedArticle>).)*)</PubMedArticle>',
>re.DOTALL)
>self.citationlist = re.findall(pattern, allinput)
>
>There are three matching to make in this body of text. The above code
>has been failing on the second of the third. This problem has only
>been occuring on linux python and Windows python (the stack in Windows
>is just larger enough to accomadate the matches.
>Text to match:
>
>http://160.129.203.97/1998_xmltest.html
>
Here's a little different approach you could try:
>>> import re
>>> import urllib
>>> allinput = urllib.urlopen('http://160.129.203.97/1998_xmltest.html').read()
>>> len(allinput)
29714
>>> pattern=re.compile('(</?PubMedArticle>)',re.DOTALL)
>>> allsplit = pattern.split(allinput)
In the following, allsplit[i] is the (.*?) text you wanted, I think, but it's a bit long, so
I just printed the first and last 80 chars and bracketed with <wyw> <...> </wyw>
([w]hat [y]ou [w]ant ;-):
>>> for i in range(2,len(allsplit),4): print '<wyw>%s\n<...>\n%s</wyw>\n' % (
... allsplit[i][:80],allsplit[i][-80:])
...
<wyw>
<MedlineCitation Status="Completed">
<MedlineID>99071918&
<...>
quot;>99071918</ArticleId>
</ArticleIdList>
</PubmedData>
</wyw>
<wyw>
<MedlineCitation Status="Completed">
<MedlineID>99071917&
<...>
quot;>99071917</ArticleId>
</ArticleIdList>
</PubmedData>
</wyw>
<wyw>
<MedlineCitation Status="Completed">
<MedlineID>99071916&
<...>
quot;>99071916</ArticleId>
</ArticleIdList>
</PubmedData>
</wyw>
Of course, this depends on there being no missing tags for <PubMedArticle> .. </PubMedArticle>
and no alternative forms of those tags.
Regards,
Bengt Richter
More information about the Python-list
mailing list