RE - non-greedy - some greediness - complete greediness

Sun Nov 10 08:36:51 EST 2002

On Sun, 10 Nov 2002 09:57:28 +0100 (MET), Doru-Catalin Togea <doru-cat at ifi.uio.no> wrote:

>Hi!
>
>I am working on a little project where i need REs with a self-defined
>level of greediness.
>
>I am processing text in a Latex like manner, for those of you who are
>familiar with it.
>
>I want to be able to recognize Latex-like commands within my text. A
>command starts with a character, say '\' followed by the command's name
>and followed by the text on which the command applies enclosed in curly
>brackets. Examples:
>
>	\footnote{some text}
>	\cite{some referance}
>	\section{some title}
>
>Recognizing such patterns and retriving the name of the command and the
>text on which the command applies is not so difficult to achieve. Things
>get complicated though when one encounters nested commands.
>
This comes up with some regularity ;-)

>Say I have the following string:
>
> "abcd \footnote{efgh \cite{myRef1} ijkl} mnop \footnote{hello there}"
>                                  ^     ^                           ^
> closing bracket of nested \cite  |     |                           |
>    closing bracket of first \footnote  |                           |
>                               closing bracket of second \footnote  |
>
>By using non-greedy REs, I would get recognized the following footnotes:
>	1) \footnote{efgh \cite{myRef1}
>	2) \footnote{hello there}
>
>The first matching is wrong because the first '}' encountered belongs to
>the nested \cite command. The second matching is correct.
>
>By using greedy REs, I would get recognized the following pattern:
>	1) \footnote{efgh \cite{myRef1} ijkl} mnop \footnote{hello there}
>
>This is wrong because a) there are two \footnote commands in my string, so
>I should get two matchings, and b) the first \footnote command should be
>applied only to the "efgh \cite{myRef1} ijkl" substring.
>
>In other words I need to be able to specify the level of greediness my REs
>should have. Is it possible? If yes, how?
>
>I have an idea of how to solve my problem without REs, by counting the
>opening and closing curly brackets and reasoning algorithmically about
>where within my text each nested command begins and ends. The question is
>can the above stated problem be solved elegantly by means of REs?
>
Since re's can't count without external help, and tricking it into using
external help is not very efficient, it is probably best either to do it
all without re or just use re to split the input into pieces that are easier
for your subsequent processing to identify and deal with. E.g. (I added 
escaped backslash and curly brackets without knowing if/how that's actually done
in Latex, and also assumed commands are lower case [a-z]+ Adjust to suit):

 >>> import re
 >>> s = r"abcd \footnote{efgh \cite{myRef1} ijkl} mnop \\not-cmd \{no curl\} \footnote{hello there}"
 >>> rx = re.compile(r'(\\[\\{}]|\\[a-z]+|[{}])')
 >>> ss = rx.split(s)
 >>> ss
 ['abcd ', '\\footnote', '', '{', 'efgh ', '\\cite', '', '{', 'myRef1', '}', ' ijkl', '}', ' mnop ',
  '\\\\', 'not-cmd ', '\\{', 'no curl', '\\}', ' ', '\\footnote', '', '{', 'hello there', '}', '']
 >>> for x in ss: print `x`
 ...
 'abcd '
 '\\footnote'
 ''
 '{'
 'efgh '
 '\\cite'
 ''
 '{'
 'myRef1'
 '}'
 ' ijkl'
 '}'
 ' mnop '
 '\\\\'
 'not-cmd '
 '\\{'
 'no curl'
 '\\}'
 ' '
 '\\footnote'
 ''
 '{'
 'hello there'
 '}'
 ''

Thus each element in the list will be r'\\' or r'\{' or r'\}' or r'\command' or '{' or '}' or plain text,
which are easy to tell apart. You could then process the list recursively or in a loop. BTW, you
can get rid of the null strings with filter, e.g.,
   ss = filter(None, rx.split(s))
if you don't want to use the fact that they serve to put matches with the split expression at
the odd indices in the list and plain text (which will be null if matches are adjacent) at the
even indices. I.e., note:

 >>> [ss[i] for i in range(1,len(ss),2)]
 ['\\footnote', '{', '\\cite', '{', '}', '}', '\\\\', '\\{', '\\}', '\\footnote', '{', '}']

and

 >>> [ss[i] for i in range(0,len(ss),2)]
 ['abcd ', '', 'efgh ', '', 'myRef1', ' ijkl', ' mnop ', 'not-cmd ', 'no curl', ' ', '', 'hello there', '']

My question would be how outer functions are applied to the results of inner ones. Are some inner
results marked for passing through the outer and others marked for modification, and then all finally
put together?

Regards,
Bengt Richter