How to get the "longest possible" match with Python's RE module?

Licheng Fang fanglicheng at gmail.com
Tue Sep 12 00:51:03 EDT 2006


MonkeeSage wrote:
> Licheng Fang wrote:
> > Basically, the problem is this:
> >
> > >>> p = re.compile("do|dolittle")
> > >>> p.match("dolittle").group()
> > 'do'
>
> >From what I understand, this isn't python specific, it is the expected
> behavior of that pattern in any implementation. You are using
> alternation, which means "either, or", and you have the shorter
> subexpression first, so the condition is satisfied by just 'do' and the
> matching terminates.
>
> > There's another example:
> >
> > >>> p = re.compile("one(self)?(selfsufficient)?")
> > >>> p.match("oneselfsufficient").group()
> > 'oneself'
>
> Again, I don't think this has anything to do with python. You pattern
> basically means "match 'one' whether it is followed by 'self' or not,
> and whether it is followed by 'selfsufficient' or not". For this
> particular example, you'd want something like
> "one(self)?(sufficient)?".
>
> I think you could construct a pattern that would do what you want in
> python without any problem. If you post a (short) example of your data,
> I'm sure someone could help you with it.
>
> Regards,
> Jordan

Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.

http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html
http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html

Python's NFA engine reads along the input string, matching it to the
pattern, and backtracking when needed. By contrast a DFA engine, to my
understanding, constructs a DFA and uses it to munch as many characters
as possible. Maybe it's like this:

Pattern: one(self)?(selfsufficient)?

PYTHON'S NFA ENGINE:

         one         self, none        selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))

DFA ENGINE:

         one             self
(start)------->((123))------------>((23))
                  |
                  |
                  |  selfsufficient
                   --------------->((3))

I want to know if there is some way to make Python RE behave like grep
does, or do I have to change to another engine?




More information about the Python-list mailing list