Freeze problem with Regular Expression

Maric Michaud maric at aristote.info
Wed Jun 25 17:31:00 EDT 2008


Le Wednesday 25 June 2008 18:40:08 cirfu, vous avez écrit :
> On 25 Juni, 17:20, Kirk <nore... at yahoo.com> wrote:
> > Hi All,
> > the following regular expression matching seems to enter in a infinite
> > loop:
> >
> > ################
> > import re
> > text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
> > una '
> > re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9
> >] *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
> > #################
> >
> > No problem with perl with the same expression:
> >
> > #################
> > $s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
> > ';
> > $s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
> > Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
> > print $1;
> > #################
> >
> > I've python 2.5.2 on Ubuntu 8.04.
> > any idea?
> > Thanks!
> >
> > --
> > Kirk
>
> what are you trying to do?

This is indeed the good question.

Whatever the implementation/language is, something like that can work with 
happiness, but I doubt you'll find one to tell you if it *should* work or if 
it shouldn't, my brain-embedded parser is doing some infinite loop too...

That said, "[0-9|a-z|\-]" is by itself strange, pipe (|) between square 
brackets is the character '|', so there is no reason for it to appears twice.

Very complicated regexps are always evil, and a two or three stage filtering 
is likely to do the job with good, or at least better, readability.

But once more, what are you trying to do ? This is not even clear that regexp 
matching is the best tool for it.

-- 
_____________

Maric Michaud



More information about the Python-list mailing list