shortest match regexp operator anyone?

Harald Kirsch kirschh at lionbioscience.com
Wed Jul 11 13:50:33 EDT 2001


SHORT STORY:
Does anyone know of a regular expression library which has an operator
that forces a subexpression to insist on its shortest match, even if
that ruins the overall match?

LONG STORY:

Look how different the following tasks are, although they look so
similar. 

1) TASK: Find the first 'A' and match, if it is followed by a 'B'?
   SOLUTION: '^[^A]+AB'

2) TASK: Find the first '<A>' and match, if it is followed by a 'B'
   SOLUTION: ???

An approximation for (2) is '^[^<>A]+<A>B', but it does not match
'A<A>B', which it should. 

With non-greedy matching, another approximation is '^.*?<A>B', however
this matches 'xx<A>y<A>B', although it should not.

The third approximation is '^(.*<A>){1,1}?B' and the Tcl's manual page
`re_syntax' seems to indicate that '{1,1}?' forces a subexpression on
a shortest match. However the `?' indicates `non-greedy' which only
means that the `shortest match not ruining an overall match' is taken
and not the `shortest match!'

The solution would in fact be a *shortest match operator*. For the
sake of discussion let it have postfix syntax like `*' and be written
as `%'. The solution for (2) would be '^(.*<A>)%B' which would *not*
match the string 'xx<A>y<A>B' since the shortest '.*<A>' found is not
followed by 'B'.

Again the question: Is there a re-library with a true shortest match
operator?

REMARK: The implementation would be rather simple since it suffices to
delete all outgoing transitions from stop states of the automaton
equivalent to the subexpression.

  Harald Kirsch
-- 
----------------+------------------------------------------------------
Harald Kirsch   | kirschh at lionbioscience.com | "How old is the epsilon?"
LION bioscience | +49 6221 4038 172          |        -- Paul Erdös
       *** Please do not send me copies of your posts. ***



More information about the Python-list mailing list