[Python-ideas] thoughts on regular expression improvements

Fri May 6 21:11:59 CEST 2011

I've been doing a lot of RE hacking lately, and some possible
improvements suggest themselves.

1.  Multiple occurrences of a named group

Right now, you can compose RE's with

   x = re.compile("...")
   y = re.compile("..." + x.pattern + "...")

But if x contains named groups, you run into trouble if you have
something like

   z = re.compile("..." + x.pattern + "..." + x.pattern + "...")

which can easily happen if x could occur at various places in z.  The
issue is that a named group is only allowed once, which isn't a bad
error-prevention mechanism, but it would be nice if it could occur more
than once (in alternative subexpressions), perhaps enabled by a another
RE flag.

2.  Easier composition.

Writing

   y = re.compile("..." + x.pattern + "...")

seems a tad groty, to use a term from my childhood, and affords the RE
engine no purchase on the composition, which can be an issue if the
flags for x are different from the flags for y.

If the first argument to re.compile could be a tuple or list, you could write

   y = re.compile(["...", x, "..."])

and the engine could see that "..." is a string, and that x is a RE, and
could inspect x as necessary.

3.  Edit distances.

The RE engine TRE (http://laurikari.net/tre/about/) supports fuzzy
matching of strings, using edit distances.

One can write an expression like "(total){~2}" which would any string
that's "total" with no more than two edit errors.

You can also specify insertions, deletions, and substitution limits
separately with "+", "-", and "#".

That would be nice to have...

Bill