AST Manipulation and Common Subexpression Elimination

Sat Oct 18 08:44:10 EDT 2003

Martin v. Löwis wrote:

> Robert Kern wrote:
>
>> 1. Can I do better than brute-force traversing the AST and trying to 
>> match each
>> subtree to the other subtrees (using something like the match() 
>> function given
>> in the parser module example)?
>
>
> Certainly: you could do hashing or sorting. For hashing, define some 
> hash function for expressions, preferably recursively

Coolest way to do this (IMHO) would be the application of something like 
the LZW compression algorithm.

Hmm... Guess I'm still too tired to explain this morning, so I'll just 
have to give an example:

Expression: x = (n1+2)*(n1-1)*(n3+2)*(n1-1)*(n1+2)*(n3+2)*(n1-3)

Step 1

dummy1 = (n1+2)
x = dummy1

First Term isn't known yet, so assign it to dummy1 and stop processing. 
Continue step two at second term.

Step 2

dummy1 = (n1+2)
dummy2 = (n1-1)
x = dummy1*dummy2

Second term isn't known yet, so assign to dummy2 and stop processing. 
Continue step three at third term.

Step 3

dummy1 = (n1+2)
dummy2 = (n1-1)
dummy3 = (n3+2)
x = dummy1*dummy2*dummy3

Third term isn't known yet, so assign to dummy3 and stop processing. 
Continue step four at fourth term.

Step 4

dummy1 = (n1+2)
dummy2 = (n1-1)
dummy3 = (n3+2)
dummy4 = dummy1*dummy2
dummy5 = dummy1*dummy3
dummy6 = dummy2*dummy3
dummy7 = dummy3*dummy4         # Remember that this is dummy1*dummy2*dummy3
x = dummy1*dummy2*dummy3*dummy2*dummy1*dummy3

Fourth term is known already in a "simple" dummy (dummy2). Now continue 
to match "simple dummies" until you find a term that isn't already 
present (n1-3). Repush the non-match onto the factor stack. This makes 
the longest matching sequence dummy2*dummy1*dummy3. Write down all 
possible combinations of at least length 2 of the found terms to dummy 
variables (regardless of order). In these dummies, replace all 
occurences of "complex dummies" with references to earlier "complex 
dummies" (dummy7 = dummy1*dummy2*dummy3, when you start at the 
beginning, this makes dummy3*dummy4). Continue step five at seventh term.

Step 5

dummy1 = (n1+2)
dummy2 = (n1-1)
dummy3 = (n3+2)
dummy4 = dummy1*dummy2
dummy5 = dummy1*dummy3
dummy6 = dummy2*dummy3
dummy7 = dummy3*dummy4
dummy8 = (n1-3)
x = dummy1*dummy2*dummy3*dummy2*dummy1*dummy3*dummy8

Seventh term isn't know yet, so assign to dummy8 and stop processing. As 
this is the last term, stop compression and continue with cleanup.

Step 6

dummy1 = (n1+2)
dummy2 = (n1-1)
dummy3 = (n3+2)
dummy4 = dummy1*dummy2
dummy5 = dummy1*dummy3
dummy6 = dummy2*dummy3
dummy7 = dummy3*dummy4        # This is dummy1*dummy2*dummy3
dummy8 = (n1-3)
x = dummy7*dummy7*dummy8

Apply same algorithm as with step 4, replacing all occurences of 
"complex dummies" found in x with references to other "complex dummies", 
starting to match at the beginning.

Step 8

Finally Remove all terms that aren't directly or indirectly referenced 
from x. (Remove dummy5 and dummy6).

This algorithm should create longest matching subsequences. If I 
understand correctly, you want to deal with more than just 
multiplication, but I guess you could modify this algorithm so that it 
works on sums and multiplication, as you need it to (it would have to 
know about bracket rules, which it doesn't at the moment).

If the above description isn't algorithmically clear, feel free to mail 
me, and I might (if I find the time) code a reference implementation.

HTH!

Heiko.