How to print all expressions that match a regular expression

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat Feb 6 20:28:47 EST 2010


On Sun, 07 Feb 2010 01:51:19 +0100, Alf P. Steinbach wrote:

>> Regular expressions are programs in a "regex" programming language.
>> What you are asking for is the same as saying:
>> 
>> "Is there a program that can enumerate every possible set of data that
>> is usable as valid input for a given program?"
>> 
>> This, in turn, is equivalent to the Halting Problem -- if you can solve
>> one, you can solve the other. You might like to google on the Halting
>> Problem before you spend too much time on this.
> 
> Hm, well, text editors /regularly/ do repeated regular expression
> searches, producing match after match after match, on request.

I think you have completely misunderstood what I'm saying.

I'm not saying that you can't *run* a regular expression against text and 
generate output. That truly would be a stupid thing to say, because I 
clearly can do this:

>>> import re
>>> mo = re.search("p.rr.t", 
... "Some text containing parrots as well as other things")
>>> mo.group()
'parrot'

As you point out, it's not hard to embed a regex interpreter inside a 
text editor or other application, or to call an external library.

What is difficult, and potentially impossible, is to take an arbitrary 
regular expression such as "p.rr.t" (the program in the regex language) 
and generate every possible data ("parrot", "pbrrat", ...) that would 
give a match when applied to that regular expression.

Now, in this case, my example is very simple, and it would be easy to 
enumerate every possible data: there's only 65025 of them, limiting to 
the extended ASCII range excluding NUL (1-255). But for an arbitrary 
regex, it won't be that easy. Often it will be unbounded: the example of 
enumerating every string that matches .* has already been given.

The second problem is, generating the data which gives the output you 
want is potentially very, very, difficult, potentially as difficult as 
finding collisions in cryptographic hash functions:

"Given the function hashlib.sha256, enumerate all the possible inputs 
that give the hexadecimal result 
0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91."

This too is unbounded, but you'll have your work cut out just to find 
*one* match, let alone an infinite number of them.

(In this specific example, your best bet is to try a crib: knowing what 
newsgroup this is, and knowing what I've written in the past, the message 
is predictable for being totally unexpected. And yes, that's a hint. A 
shiny penny for the first person to guess what it is.)

I'm suggesting that, in general, there's no way to tell in advance which 
regexes will be easy and which will be hard, and even when they are easy, 
the enumeration will often be infinite. 


-- 
Steven



More information about the Python-list mailing list