[Python-Dev] Regular expressions, Unicode etc.

Nick Maclaren nmm1 at cus.cam.ac.uk
Wed Aug 8 21:47:11 CEST 2007


I am not on "Python 3000", so am restricting.

Mike Klaas <mike.klaas at gmail.com> wrote:
>
> > I have needed to push my stack to teach REs (don't ask), and am
> > taking a look at the RE code.  I may be able to extend it to support
> > RFE 694374 and (more importantly) atomic groups and possessive
> > quantifiers.  While I regard such things as revolting beyond belief,
> > they make a HELL of a difference to the efficiency of recognising
> > things like HTML tags in a morass of mixed text.
> 
> +1.  I would use such a feature.

I think that I am getting somewhere, but I really dislike the style
of _sre.c.  It has a very complex semi-stack, semi-finite-state
design and no comments on how it is supposed to work.

And its memory management looks like a recipe for leaks, so I may
well introduce some of them.

> > The other approach, which is to stick to true regular expressions,
> > and wholly or partially convert to DFAs, has already been rendered
> > impossible by even the limited Perl/PCRE extensions that Python
> > has adopted.
> 
> Impossible?  Surely, a sufficiently-competent re engine could detect  
> when a DFA is possible to construct?

I doubt it.  While it isn't equivalent to the halting problem, it IS
an intractable one!  There are two problems:

Firstly, things like backreferences are an absolute no-no.  They
are not regular, and REs with them in cannot be converted to DFAs.
That could be 'solved' by a parser that kicked out such constructions,
but it would get screams from many users.

Secondly, anything involving explicit or implicit negation can lead
to (if I recall) a super-exponential explosion in the size of the
DFA.  That could be 'solved' by imposing a limit, but few people
would be able to predict when it would bite.

Thirdly, I would require notice of the question of whether capturing
parentheses could be supported, and what semantic changes would be
to which were set and how.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1 at cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679


More information about the Python-Dev mailing list