[Tutor] Question regular expressions - the non-greedy pattern

Lie Ryan lie.1296 at gmail.com
Tue Jan 22 01:44:31 CET 2013


On 22/01/13 10:11, Marcin Mleczko wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello Hugo, hello Walter,
>
> first thank you very much for the quick reply.
>
> The functions used here i.e. re.match() are taken directly form the
> example in the mentioned HowTo. I'd rather use re.findall() but I
> think the general interpretetion of the given regexp sould be nearly
> the same in both functions.
>
> So I'd like to neglect the choise of a particular function for a
> moment a concentrate on the pure theory.
> What I got so far:
> in theory form s = '<<html><head><title>Title</title>'
> '<.*?>' would match '<html>' '<head>' '<title>' '</title>'
> to achieve this the engine should:
> 1. walk forward along the text until it finds <
> 2. walk forward from that point until in finds >
> 3. walk backward form that point (the one of >) until it finds <
> 4. return the string between < from 3. and > from 2. as this gives the
> least possible string between < and >
>
> Did I get this right so far? Is this (=least possible string between <
> and >), what non-greedy really translates to?

Nope, that's not how regex works. In particular, it does not do step 3, 
non-greedy regex do not walk backward to find the shortest string that 
matches.

What it does for the non-greedy pattern <.*?> is:

1. walk forward along the text until it finds <
2. if the next character matches . (any chars) then step one character 
forward, otherwise backtrack from where we left off in the previous step
3. if the next char is >, then we find a match, otherwise backtrack from 
where we left off in the previous step
4. return the string between < from 1 and > from 3

or alternatively

1. walk forward along the text until it finds <
2. walk forward from that point until in finds >
3. return the string between < from 1 and > from 2



More information about the Tutor mailing list