regular expression problem

MRAB python at mrabarnett.plus.com
Sun Oct 28 16:04:37 EDT 2018


On 2018-10-28 18:51, Karsten Hilbert wrote:
> Dear list members,
> 
> I cannot figure out why my regular expression does not work as I expect it to:
> 
> #---------------------------
> #!/usr/bin/python
> 
> from __future__ import print_function
> import re as regex
> 
> rx_works = '\$<[^<:]+?::.*?::\d*?>\$|\$<[^<:]+?::.*?::\d+-\d+>\$'
> # it fails if switched around:
> rx_fails = '\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$'
> line = 'junk  $<match_A::options A::4>$  junk  $<match_B::options B::4-5>$  junk'
> 
> print ('')
> print ('line:', line)
> print ('expected: $<match_A::options A::4>$')
> print ('expected: $<match_B::options B::4-5>$')
> 
> print ('')
> placeholders_in_line = regex.findall(rx_works, line, regex.IGNORECASE)
> print('found (works):')
> for ph in placeholders_in_line:
> 	print (ph)
> 
> print ('')
> placeholders_in_line = regex.findall(rx_fails, line, regex.IGNORECASE)
> print('found (fails):')
> for ph in placeholders_in_line:
> 	print (ph)
> 
> #---------------------------
> 
> I am sure I simply don't see the problem ?
> 
Here are some of the steps while matching the second regex. (View this 
in a monospaced font.)


1:
junk  $<match_A::options A::4>$  junk  $<match_B::options B::4-5>$  junk
       ^

\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
^


2:
junk  $<match_A::options A::4>$  junk  $<match_B::options B::4-5>$  junk
                  ^

\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
             ^


3:
The .*? matches as few characters as possible, initially none.

junk  $<match_A::options A::4>$  junk  $<match_B::options B::4-5>$  junk
                           ^
                                                     ^
\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
                ^


4:
junk  $<match_A::options A::4>$  junk  $<match_B::options B::4-5>$  junk
                              ^

\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
                     ^

At this point it can't match, so it backtracks.


5:
The .*? matches more characters, including the ":".

After more matching it's like the following.

junk  $<match_A::options A::4>$  junk  $<match_B::options B::4-5>$  junk
                                                 ^

\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
                ^


6:
junk  $<match_A::options A::4>$  junk  $<match_B::options B::4-5>$  junk
                                                   ^

\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
                  ^

Again it can't match, so it backtracks.


7:
The .*? matches more characters, including the ":".

After more matching it's like the following.

junk  $<match_A::options A::4>$  junk  $<match_B::options B::4-5>$  junk
                                                            ^

\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
                ^

8:
junk  $<match_A::options A::4>$  junk  $<match_B::options B::4-5>$  junk
                                                                   ^

\$<[^<:]+?::.*?::\d+-\d+>\$|\$<[^<:]+?::.*?::\d*?>\$
                            ^

Success!

The first choice has matched this:

$<match_A::options A::4>$  junk  $<match_B::options B::4-5>$



More information about the Python-list mailing list