How to print all expressions that match a regular expression

Tim Chase python.list at tim.thechases.com
Sun Feb 7 08:47:17 EST 2010


hzhuo1 at gmail.com wrote:
>>> "Given the function hashlib.sha256, enumerate all the possible inputs
>>> that give the hexadecimal result
>>> 0a2591aaf3340ad92faecbc5908e74d04b51ee5d2deee78f089f1607570e2e91."
> 
> This is a hash collision problem. Nobody has proved that SHA-256 is
> collision free

It's actually pretty easy to prove that it is *not* collision 
free.  The SHA-256 encodes 512 bits of data.  So the the process 
of encoding (2**512)+1 distinct inputs incurs a collision in 
SHA-256 space as soon as you've hit (2**512)+1 if not earlier.

to start you off:

   sha_backmap = {}
   for i in xrange((2**512)+2):
     hash = sha(str(i))
     if hash in sha_backmap:
       print "Collision found: %i and %i" % (
         i, sha_backmap[hash])
         break
     sha_backmap[hash] = i

Though it might take a computer the size of the universe, so I'm 
guessing that the first collision encountered is with "42".  I 
leave the actual calculation and hashing of all possible 
combinations of 513 bits of data as an exercise to the reader 
with a lot of time on their hands or a quantum computer under 
their desk ;-)

> It is hard to tell in advance. However, we can add some timing limit
> or counting limit, to make it an algorithm, which can halt. For
> example, whenever the program outputs more than 1000000 expressions
> that match the input regex, we can halt because that exceeds our
> limit. But surely this is not efficient because of the post-decision.

As mentioned, it sounds like you either want a depth-first of the 
solution space that raises exceptions on an infinite/unbounded 
operator ("*", "+", and "{N,}" as mentioned in another email), or 
if you want to handle those operators, do a breadth-first search 
of the solution-space and track your depth (or time taken, or 
previous number of multi-factor atoms if you desire) to ensure 
you don't exceed a certain depth.  But you're still talking a 
combinatorial number of solutions for even simple regexps.

-tkc





More information about the Python-list mailing list