Help with regular expression using findall and .*?

Harvey Thomas hst at empolis.co.uk
Fri Sep 13 12:47:11 EDT 2002


czrpb wrote
> 
> Harvey:
> 
> Great thanks!! And thanks for sticking to my question's 
> requirements. <wink!>
> 
> Ok, this is what we thought around here. But what I do not 
> understand is why any backtracking data is being kept? The 
> '?' in '.*?' means it is non-greedy right? When would 
> backtracking ever occur using '.*?'? What am I missing?
> 
> <<q
> 
> 
> On Fri, 13 Sep 2002, Harvey Thomas wrote:
> 
> > czrpb wrote
> > > 
> > > Could anyone help out with rewriting (still using regular 
> expressions)
> > > the following so that it does not cause an exception:
> > > 
> > > import re
> > > 
> > > s1=('macro\n'+'a'*200+'\norcam\n')*10
> > > s2=('macro\n'+'a'*20000+'\norcam\n')*10
> > > 
> > > p=re.compile(r'macro.*?orcam',re.DOTALL)
> > > 
> > > for x in re.findall(p,s1):
> > >     print x
> > > 
> > > for x in re.findall(p,s2):
> > >     print x
> > > 
> > > thanks!! Quentin Crain
> > > 
> > 
> > You need to be very careful about using .*? as the engine 
> "only" allows 10,000 backtracks
> > 
> > Try this
> > 
> > p = re.compile('macro(?:[^o]+|o(?!rcam))*orcam')
> > for x in p.findall(s2):
> >     print x
> > 
> > HTH
> > 
> > Harvey
> > 

Don't think I, or even Mr. Friedl can explain this concisely. If you are interested in knowing about the mechanics of REs, the best reference book is the 2nd edition of "Mastering Regular Expressions" by Jeffrey Friedl ISBN 0-596-00289-0.

A good tip though is to try and use negated character classes whenever possible rather than use "."

Harvey

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.




More information about the Python-list mailing list