Confusion about re lookahead assertions

Sat Apr 1 16:49:41 EST 2000

[posted & mailed]

[Skip Montanaro]
> I'm trying to use re's lookahead assertion constructs for the first time
> and am more than a bit confused. If I understand them correctly, the
> regular expression r'(?![.])([0-9]+)\s+' should only match digits if they
> are not preceeded by a dot, yet the example below clearly contradicts
> that.

It's your understanding that needs adjustment here.  An "assertion" never
consumes a character.  So, e.g.,

    (?!E1)(E2)

matches at a point p iff E1 does not match at p and E2 does match at p.  In
your case, a digit is not a decimal point, so (?![.])\d+ will match any
string of (one or more) digits.  More generally, if the set of strings E1
and E2 can match are disjoint, sticking (?!E1) in front of E2 does nothing
for you.  Think of it as "all the strings that can match E2, *except for*
the strings that match E1".  In your case, the set of strings composed of
digits minus the set of strings that begin with a decimal point is simply
the set you started with (because they have nothing in common).

> >>> pat = re.compile(r'(?![.])([0-9]+)\s+')

By the above, this is effectively the same as

    pat = re.compile(r'([0-9]+)\s+')

and everything you saw follows from that.

What you really want is a negative look*behind* assertion, to say "match a
string of digits, but not if looking backward from the start I see a decimal
point".

Presumably you can get the effect you want without this assertion stuff, via
e.g.

    pat = re.compile(r'(?:^|[^.])(\d+)\s+')

> pat = re.compile(r'(?![-A-Za-z0-9:._])([0-9]+)\s+')

In this case, every string that can match E2 also matches E1, so E2-E1 is
empty:  this can't match anything.

An interface that expressed this stuff as set operations on regular
languages instead would make it all obvious; try to *think* of it that way
anyway <wink>.

easy-to-do-if-you-don't-need-to-ly y'rs  - tim