Why no RE match of A AND B?

Tim Peters tim.one at comcast.net
Mon Mar 3 22:59:22 EST 2003


[John Machin]
> I seem to detect two quite different usages of "intersection":
>
> (1) The regular language (RL) characterised by the RE "abc" produces
> just one string i.e. 'abc'. Ditto the RL described by the RE "xyz"
> produces just one string 'xyz'.

Most people think of the Python regexp "abc" matching any string containing
"abc", though.  You can't get very far discussing regexps in terms of
regular expressions <0.9 wink>.

> I (a mug punter, not an expert) would understand the intersection of the
> two RLs to be empty i.e. [] i.e. the two RLs produce no common strings.

Yes.

> Same with "a+" and "b+", whereas "a*" and "b*" have a non-empty
> intersection [''].

Both so.

> (2) The effect that the OP (together with Tim and other posters)
> seemed to expect from the & operator was that (abc)&(xyz) would match
> (abcxyz)|(xyzabc).

I expected that the OP expected it would be equivalent to the RE

    ^[\x00-\xff]*(abc[\x00-\xff]*xyz|xyz[\x00-\xff]*abc)[\x00-\xff]*\Z

in terms of whether or not it matched.

> <:-)>
> And now for a pragmatic point: if the & operator was ever so slightly
> useful, or ever so slightly demanded, or ever so slightly easy to
> implement, it would have surfaced in a certain language long before
> now.
> </:-)>

RL intersection *is* routinely implemented in finite-state machine packages.
For example, see

    http://www.research.att.com/~mohri/fsm/

and that package's fsmintersect program.  I can think of very few plausible
use cases for it in string-matching, though ("it's a Sunday and a Thursday"
or "it's an identifier and it's a number" make less sense than "it's a floor
wax and a dessert topping"), and in this respect it's relevant that common
notations for grammar specification (like EBNF) don't bother with an
intersection operator either.






More information about the Python-list mailing list