[Python-bugs-list] [Bug #110664] PRIVATE: Bug in re module (PR#36)
noreply@sourceforge.net
noreply@sourceforge.net
Fri, 18 Aug 2000 13:37:26 -0700
Bug #110664, was updated on 2000-Jul-31 21:11
Here is a current snapshot of the bug.
Project: Python
Category: Modules
Status: Closed
Resolution: Wont Fix
Bug Group: None
Priority: 5
Summary: PRIVATE: Bug in re module (PR#36)
Details: Jitterbug-Id: 36
Submitted-By: chenna@embl-heidelberg.de
Date: Fri, 23 Jul 1999 06:51:43 -0400 (EDT)
Version: 1.5.1
OS: OSF1
Hello I get
Stack overflow: pid 14731, proc emparse1.py, addr 0x11fdfffd8, pc 0x120068cd8
Segmentation fault
when I run the following. The problem is re module unable to search
for the pattern that is too large, but this is a requirement for my
application in biology. I enclose the source code with this.
Please email me the solution as soon as possible
Thanks
Ramu
___________________________
#!/usr/pub/bin/python1.5
#
#
#
# (C) Chenna Ramu, EMBL, Heidelberg, Germany
#
# parser for biological databases
#
import string
import sys
import re
parserDict = {
'id' : r'((^ID [^\n]+\n)+)' ,
'ac' : r'((^AC [^\n]+\n)+)' ,
'dt' : r'((^DT [^\n]+\n)+)' ,
'de' : r'((^DE [^\n]+\n)+)' ,
'gn' : r'((^GN [^\n]+\n)+)' ,
'os' : r'((^OS [^\n]+\n)+)' ,
'oc' : r'((^OC [^\n]+\n)+)' ,
'ref' : r'(('
r'(^RN [^\n]+\n)'
r'((^RP [^\n]+\n)+)'
r'((^RX [^\n]+\n)?)'
r'((^RA [^\n]+\n)+)'
r'((^RT [^\n]+\n)*)'
r'((^RL [^\n]+\n)+)'
r')+)',
'cc' : r'((^CC [^\n]+\n)+)' ,
'dr' : r'((^DR [^\n]+\n)+)' ,
'kw' : r'((^KW [^\n]+\n)+)' ,
'ft' : r'((^FT [^\n]+\n)+)' ,
'sq' : r'(^SQ [^\n]+\n)' \
r'((^ [^\n]+\n)+)'
}
_hrefLink = {'embl':['<A href=%s>%s</A>','^DR ([^;]+)']} #should be like this
hrefLink = {'EMBL':"<A href=http://www/wgetz?-e+[%s-id:%s]>%s</a>",
'MIM':"<A href=http://www/wgetz?-e+[%s-id:%s]>%s</a>"}
em_rep = r'(^DR )(?P<dbase>[^;]+); (?P<id>[^;]+)'
class embl:
def __init__(self,parserDict={}):
self.parserDict = {}
self.reDict = {} #keep the compiled re's
self.fields = []
if parserDict:
self.Init(parserDict)
def Init(self,parserDict):
self.parserDict = parserDict
self.fields = parserDict.keys()
for j in self.fields:
setattr(self, j, None)
self.reDict[j] = re.compile(parserDict[j],re.MULTILINE)
def Parse(self,str):
if not self.parserDict:
print "No parser specified"
return
for k,v in parserDict.items():
## tmp = re.compile(v,re.MULTILINE) # move this to __init__
tmp = self.reDict[k]
mat = tmp.search(str)
if mat:
setattr(self, k, mat.group() )
def Field(self,name):
try:
return getattr(self,name)
except AttributeError:
return None
def PrintFields(self):
flds = self.fields
for j in flds:
print "==> ",j
print getattr(self,j)
def ReParse(self,str,retToken,pat):
""" str:string to parse, retToken:return token, pat:parser """
_p = re.compile(pat)
m = _p.search(str)
if m:
return m.group(retToken)
else:
return None
def Href(match):
""" Replaces the hrefs """
dbase = match.group('dbase')
id = match.group('id')
try:
defi = hrefLink[dbase]
except KeyError:
defi = None
if defi:
tmp = match.group(1) + dbase + '; '+defi %(dbase,id,id)
else:
tmp = match.group(1)+ dbase + '; ' + id
return tmp
def test(fileName):
sys.path.insert(0,'/home/chenna/py')
## from seqFormat import *
e = embl(parserDict)
# f = open('acha_mouse.dat','r')
f = open(fileName,'r')
a = f.readlines()
f.close()
a = string.join(a,'')
e.Parse(a)
e.PrintFields()
import string
print ' the fields are ',e.fields
## seq = string.split(e.sq,';')[-1]
## s = Seq(seq,'test')
## print 'check===>',s.seq
## s.SeqPrint('swiss')
seqLen = e.ReParse(e.sq[0:50],'seqLen',r'^SQ [^ ]+ *(?P<seqLen>(\d+))')
print e.sq
print "length of the sequence ",seqLen
print e.ref
dr = e.dr
print dr
p = re.compile(em_rep,re.M)
dr = p.sub(Href,dr)
print dr
print e.Field('id')
print e.Field('dr')
print e.Field('mm')
return
def test1(dumm=None):
tmp = ['SQ Sequence 1041931 BP; 8972 A; 5950 C; 6264 G; 8224 T; 0
other;\n']
for j in range(1,17365):
t = ' ' + 'tcagtcagtg ' * 6 + '\n'
tmp.append(t)
a = string.join(tmp)
e = embl(parserDict)
e.Parse(a)
e.PrintFields()
if __name__ == '__main__':
try:
test1(sys.argv[1])
except:
test1()
====================================================================
Audit trail:
Sat Jul 24 16:53:53 1999 guido changed notes
Sat Jul 24 16:53:53 1999 guido moved from incoming to open
Sat Jul 24 17:04:17 1999 guido changed notes
Follow-Ups:
Date: 2000-Aug-06 20:53
By: twouters
Comment:
Fair chance this is already fixed, either in 1.5.2 remodule, or in the new 'sre' by Fredrik. Maybe /F can say something useful here.
-------------------------------------------------------
Date: 2000-Aug-07 12:07
By: effbot
Comment:
It works fine under SRE, but bombs under PRE using a recent 2.0 build (tested on Windows).
-------------------------------------------------------
Date: 2000-Aug-18 20:37
By: akuchling
Comment:
Boiled down test case:
import string
import pre as re
pat = re.compile('(^ [^\n]+\n)+', re.MULTILINE)
str = 17365 * ( ' ' + 'tcagtcagtg ' * 6 + '\n' )
pat.search( str )
The problem stems from nesting [^\n]+ inside a group that is then
repeated. PCRE does a C recursion after matching [^\n]+,
so on a long string the engine ends up nesting very deeply and filling the stack. With my knowledge of the engine, I see no way to rewrite the pattern to avoid this, so the program would have to be restructured to do its work differently.
Not easily fixable without tearing PCRE apart; since SRE seems
to survive this pattern, let's just count our blessings. :)
-------------------------------------------------------
For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=110664&group_id=5470