shortest match regexp operator anyone?
Harald Kirsch
kirschh at lionbioscience.com
Wed Jul 11 13:50:33 EDT 2001
SHORT STORY:
Does anyone know of a regular expression library which has an operator
that forces a subexpression to insist on its shortest match, even if
that ruins the overall match?
LONG STORY:
Look how different the following tasks are, although they look so
similar.
1) TASK: Find the first 'A' and match, if it is followed by a 'B'?
SOLUTION: '^[^A]+AB'
2) TASK: Find the first '<A>' and match, if it is followed by a 'B'
SOLUTION: ???
An approximation for (2) is '^[^<>A]+<A>B', but it does not match
'A<A>B', which it should.
With non-greedy matching, another approximation is '^.*?<A>B', however
this matches 'xx<A>y<A>B', although it should not.
The third approximation is '^(.*<A>){1,1}?B' and the Tcl's manual page
`re_syntax' seems to indicate that '{1,1}?' forces a subexpression on
a shortest match. However the `?' indicates `non-greedy' which only
means that the `shortest match not ruining an overall match' is taken
and not the `shortest match!'
The solution would in fact be a *shortest match operator*. For the
sake of discussion let it have postfix syntax like `*' and be written
as `%'. The solution for (2) would be '^(.*<A>)%B' which would *not*
match the string 'xx<A>y<A>B' since the shortest '.*<A>' found is not
followed by 'B'.
Again the question: Is there a re-library with a true shortest match
operator?
REMARK: The implementation would be rather simple since it suffices to
delete all outgoing transitions from stop states of the automaton
equivalent to the subexpression.
Harald Kirsch
--
----------------+------------------------------------------------------
Harald Kirsch | kirschh at lionbioscience.com | "How old is the epsilon?"
LION bioscience | +49 6221 4038 172 | -- Paul Erdös
*** Please do not send me copies of your posts. ***
More information about the Python-list
mailing list