Enumerating Regular Expressions

Tue May 9 08:07:16 EDT 2006

> Why are people getting stuck on infinite regular
> languages?  I've made it quite clear that I'm only really
>  interested in doing this for finite languages, but that
> shouldn't matter anyway.

The power of regular expressions is that they define a 
consice means to encapsulate an infinite number of inputs. 
If you're not tapping this power, the general wisdom is to 
not use regular expressions.

That said, unless you can ensure you don't have an infinite 
grammer (in particular, no asterisks or plus-signs or 
unbounded ranges like "{4,}"), you then have the problem of 
how to navigate that infinite set.

Taking for example the most common/simple regexp:

	".*"

That matches any and every string.  There's a whole library 
of Congress for which every written work will match that 
string.  It sounds a lot like Jorge Luis Borges's 1956 "The 
Library of Babel".

Do you begin searching depth-first with "a" and then "aa" 
and then "aaa"?  Or do you begin breadth-first with "a", 
then "b"...then "z", then "aa", "ab", ...?  Depth-first is 
far more efficient when searching a branching set of data, 
but when you have infinite depth, it will take a while ;) 
Alternatively, you can use some sort of iterative-deepening 
search which puts an increasing bound on how deep the 
depth-first search will go before giving up for that 
particular iteration.  Googling on "iteritave deepening" 
should bring back some ideas for that one.

If you truely have a finite grammer, it would be feasible to 
write some sort of parser, ugly as it might be.  As a first 
pass, I'd have my parser emit python code to write N nested 
loops to try each combination of each element, so something like

	"[a-zA-Z_][a-zA-Z0-9_]"

became something like

def tries():
	s1 = set(range(ord('a'),ord('z')+1))
	s1.add(range(ord('A'),ord('Z')+1))
	s1.add(ord('_'))
	s2 = set(range(ord('a'),ord('z')+1))
	s2.add(range(ord('A'),ord('Z')+1))
	s2.add(range(ord('0'),ord('9')+1))
	s2.add(ord('_'))
	for a1 in s1:
		for a2 in s2:
			yield "%s%s" % (chr(a1), chr(a2))

Thus, the input regexp would generate python code that you 
could then include in your project.  Ugly, but a possible 
solution.  There are lots of peculiarities though that would 
need to be handled.  Branching, escaped characters in 
character sets (well, escaping in general is a headache).

An alternative might be brute-forcing your regexp:

	for a1 in range(0,256):
		for a2 in range(0,256):
			s = "%s%s" % (chr(a1),chr(a2))
			m = re.match(s)
			if m:
				yield s

nested to however deep you're interested in.

All will be slow.  All are pretty ugly.

-tkc