[Python-bugs-list] [Bug #110664] PRIVATE: Bug in re module (PR#36)

Fri, 18 Aug 2000 13:37:26 -0700

Bug #110664, was updated on 2000-Jul-31 21:11
Here is a current snapshot of the bug.

Project: Python
Category: Modules
Status: Closed
Resolution: Wont Fix
Bug Group: None
Priority: 5
Summary: PRIVATE: Bug in re module (PR#36)

Details: Jitterbug-Id: 36
Submitted-By: chenna@embl-heidelberg.de
Date: Fri, 23 Jul 1999 06:51:43 -0400 (EDT)
Version: 1.5.1
OS: OSF1

Hello I get 

Stack overflow: pid 14731, proc emparse1.py, addr 0x11fdfffd8, pc 0x120068cd8
Segmentation fault

when I run the following. The problem is re module unable to search 
for the pattern that is too large, but this is a requirement for my 
application in biology.  I enclose the source code with this.

Please email me the solution as soon as possible

Thanks
Ramu

___________________________

#!/usr/pub/bin/python1.5
#
#
#
#   (C)  Chenna Ramu, EMBL, Heidelberg, Germany
#
#         parser for biological databases
#

import string
import sys
import re

parserDict = {
    'id' : r'((^ID   [^\n]+\n)+)' ,
    'ac' : r'((^AC   [^\n]+\n)+)' ,
    'dt' : r'((^DT   [^\n]+\n)+)' ,
    'de' : r'((^DE   [^\n]+\n)+)' ,
    'gn' : r'((^GN   [^\n]+\n)+)' ,
    'os' : r'((^OS   [^\n]+\n)+)' ,
    'oc' : r'((^OC   [^\n]+\n)+)' ,

    'ref' : r'(('
              r'(^RN   [^\n]+\n)'
              r'((^RP   [^\n]+\n)+)' 
              r'((^RX   [^\n]+\n)?)' 
              r'((^RA   [^\n]+\n)+)' 
              r'((^RT   [^\n]+\n)*)' 
              r'((^RL   [^\n]+\n)+)' 
            r')+)',

    'cc' : r'((^CC   [^\n]+\n)+)' ,
    'dr' : r'((^DR   [^\n]+\n)+)' ,
    'kw' : r'((^KW   [^\n]+\n)+)' ,
    'ft' : r'((^FT   [^\n]+\n)+)' ,
    'sq' : r'(^SQ  [^\n]+\n)' \
           r'((^    [^\n]+\n)+)'
    }

_hrefLink = {'embl':['<A href=%s>%s</A>','^DR   ([^;]+)']} #should be like this

hrefLink = {'EMBL':"<A href=http://www/wgetz?-e+[%s-id:%s]>%s</a>",
            'MIM':"<A href=http://www/wgetz?-e+[%s-id:%s]>%s</a>"}

em_rep = r'(^DR   )(?P<dbase>[^;]+); (?P<id>[^;]+)'

class embl:    
    def __init__(self,parserDict={}):
        self.parserDict = {}
        self.reDict = {}  #keep the compiled re's

        self.fields = []
        if parserDict:
            self.Init(parserDict)

    def Init(self,parserDict):
        self.parserDict = parserDict
        self.fields = parserDict.keys()
        for j in self.fields:
            setattr(self, j, None)
            self.reDict[j] = re.compile(parserDict[j],re.MULTILINE)

    def Parse(self,str):
        if not self.parserDict:
            print "No parser specified"
            return
        for k,v in parserDict.items():
##          tmp = re.compile(v,re.MULTILINE)  # move this to __init__
            tmp = self.reDict[k]
            mat = tmp.search(str)
            if mat:
                setattr(self, k, mat.group() )

    def Field(self,name):
        try:
            return getattr(self,name)
        except AttributeError:
            return None

    def PrintFields(self):
        flds = self.fields
        for j in flds:
            print "==> ",j
            print getattr(self,j)

    def ReParse(self,str,retToken,pat):
        """ str:string to parse,  retToken:return token, pat:parser """
        _p = re.compile(pat)
        m = _p.search(str)
        if m:
            return m.group(retToken)
        else:
            return None

def Href(match):
    """  Replaces the hrefs """
    dbase = match.group('dbase')
    id = match.group('id')
    try:
        defi = hrefLink[dbase]
    except KeyError:
        defi = None
    if defi:
        tmp = match.group(1) + dbase + '; '+defi %(dbase,id,id)
    else:
        tmp = match.group(1)+ dbase + '; ' + id 
    return tmp

def test(fileName):
    sys.path.insert(0,'/home/chenna/py')
##  from seqFormat import *

    e = embl(parserDict)
#    f = open('acha_mouse.dat','r')
    f = open(fileName,'r')
    a = f.readlines()
    f.close()
    a = string.join(a,'')
    e.Parse(a)
    e.PrintFields()

    import string
    print ' the fields are ',e.fields
##    seq = string.split(e.sq,';')[-1]
##    s = Seq(seq,'test')
##    print 'check===>',s.seq
##    s.SeqPrint('swiss')

    seqLen = e.ReParse(e.sq[0:50],'seqLen',r'^SQ   [^ ]+ *(?P<seqLen>(\d+))')

    print e.sq
    print "length of the sequence ",seqLen
    print e.ref

    dr = e.dr
    print dr

    p = re.compile(em_rep,re.M)
    dr = p.sub(Href,dr)
    print dr
    print e.Field('id')
    print e.Field('dr')
    print e.Field('mm')

    return 

def  test1(dumm=None):
    tmp = ['SQ   Sequence 1041931 BP; 8972 A; 5950 C; 6264 G; 8224 T; 0
other;\n']
    for j in range(1,17365):
        t = '     ' + 'tcagtcagtg ' * 6 + '\n'
        tmp.append(t)
    a = string.join(tmp)
    e = embl(parserDict)
    e.Parse(a)
    e.PrintFields()

if __name__ == '__main__':
    try:
        test1(sys.argv[1])
    except:
        test1()

====================================================================
Audit trail:
Sat Jul 24 16:53:53 1999	guido	changed notes
Sat Jul 24 16:53:53 1999	guido	moved from incoming to open
Sat Jul 24 17:04:17 1999	guido	changed notes

Follow-Ups:

Date: 2000-Aug-06 20:53
By: twouters

Comment:
Fair chance this is already fixed, either in 1.5.2 remodule, or in the new 'sre' by Fredrik. Maybe /F can say something useful here.

-------------------------------------------------------

Date: 2000-Aug-07 12:07
By: effbot

Comment:
It works fine under SRE, but bombs under PRE using a recent 2.0 build (tested on Windows).
-------------------------------------------------------

Date: 2000-Aug-18 20:37
By: akuchling

Comment:
Boiled down test case:

import string
import pre as re
pat = re.compile('(^    [^\n]+\n)+', re.MULTILINE)
str = 17365 * ( '     ' + 'tcagtcagtg ' * 6 + '\n' )
pat.search( str )

The problem stems from nesting [^\n]+ inside a group that is then
repeated.    PCRE does a C recursion after matching [^\n]+, 
so on a long string the engine ends up nesting very deeply and filling the stack.  With my knowledge of the engine, I see no way to rewrite the pattern to avoid this, so the program would have to be restructured to do its work differently.

Not easily fixable without tearing PCRE apart; since SRE seems 
to survive this pattern, let's just count our blessings. :)

-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=110664&group_id=5470