unexpected regexp behaviour using 'A|B|C.....'

AlienBaby matt.j.warren at gmail.com
Thu Jul 28 05:56:33 EDT 2011


When using re patterns of the form 'A|B|C|...'  the docs seem to
suggest that once any of A,B,C.. match, it is captured and no further
patterns are tried.  But I am seeing,

st='  Id Name                    Prov Type  CopyOf              BsId
Rd -Detailed_State-    Adm     Snp      Usr VSize'

p='Type *'
re.search(p,st).group()
'Type  '

p='Type *|  *Type'
re.search(p,st).group()
' Type'


Shouldn’t the second search return the same as the first, if further
patterns are not tried?

The documentation appears to suggest the first match should be
returned, or am I misunderstanding?

'|'
A|B, where A and B can be arbitrary REs, creates a regular expression
that will match either A or B. An arbitrary number of REs can be
separated by the '|' in this way. This can be used inside groups (see
below) as well. As the target string is scanned, REs separated by '|'
are tried from left to right.

 When one pattern completely matches, that branch is accepted. This
means that once A matches, B will not be tested further, even if it
would produce a longer overall match.

In other words, the '|' operator is never greedy. To match a literal
'|', use \|, or enclose it inside a character class, as in [|].





More information about the Python-list mailing list