[Compiler-sig] AST observations

Thu, 18 Apr 2002 08:52:00 GMT

[Eric C. Newton]

>I've noticed a few things with the current implementation of ASTs in
>Python 2.2 from my work on PyChecker2.  These are just some notes.
>[...]
>
>   There is this comment:
>
>        "If the visitor method returns a true value, the
>        ASTVisitor will not traverse the child nodes."
>
>   I see no code which checks the return value.

And that is IMO a very good thing that is isn't implemented as
documented. If the visitor/walker hijacked the return value of visitXXXX
methods for its own purposes, the visitor would be completely useless in
both our bytecode compiler and our javacode compiler.

>   Performance:
>
>        For me, the _cache mechanism actually slows down
>        visitation. I found that pre-caching the method names in
>        preorder() _is_ faster.
>
>        Most of my dispatching uses the default dispatcher; the
>        getChildNodes() method, along with compiler.ast.flatten
>        and compiler.ast.flatten_nodes are significant overheads.
>
>        All that said, I re-wrote it using a number of techniques:
>
>           removed the ability to add variable arguments to the walk
>                (no improvement)
>
>           fewer transformations from lists to tuples (minor improv.)
>
>           smarter construction of lists (minor improv.)
>
>           custom getChildNodes() for each class to eliminate
>           calls to flatten() (minor improv.)
>
>           pass a visit() function to a visitChildren() method (slower!)
>
>           write a default recursive dispatcher in C (slower!)
>
>   Convention:
>
>        Setting the "visit" method on the Visitor is, ahem, a novel
>        approach. 

I think you are way too kind here. It sucks; it is plainly a hack and it
is quite hard to map into efficient java.

>        It's a convention that PyChecker (the use of, not
>        the development of) doesn't like ("unknown method visit()").
>        I don't know if I don't like it, but it was unexpected.
>        Passing the walker to the dispatch function is the sort of
>        thing I would expect.
>
>In general, pychecker2 calls "walk(node, Visitor())" a LOT.  The first
>version of pychecker did a lot of things in a single pass.  That is
>pretty efficient, but it's harder to add more checks without creating
>really dense code.  I'm trying to structure pychecker2 around lots of
>independent checks, so it will be easier to contribute new code.  The
>consequence is: I would really like the visitor stuff to run
>efficiently.

I also want the visitor pattern to be superfast because I want to use it
for the on-the-fly javabytecode generation. If it turns out that the
chosen visitor pattern isn't sufficiently efficient I'll be forced to
make our own visitor pattern in parallel with the one in the compiler
package

>I gave up on trying to use recursion to figure out a Node's parents in
>an AST tree.  Very often I need to know the parents of the Node I'm
>looking at, and using recursion to hold this information was becoming
>cumbersome.  For the time being, I'm adding a parent link to all AST
>nodes just after a file is parsed.  An alternative data-structure
>(heap?) would be just as swell so long as I can compute this
>efficiently.

I don't need a parent link for codegeneration (if I did, it would have
been added already <wink>) so from my primary POV, adding a parent link
is pure memory overhead. 

OTOH in my treebuilder I have hooks what would allow me to create a
dictionary of node->parentnode mappings while I'm creating the AST tree.

So how is this for an alternative idea: The main methods (parse() and
parseFile()) grows an optional dict=None argument. When that argument if
not None each created AST node is inserted as key with the parent node
as its value.

>Line numbers appear to be added to AST nodes in arbitrary ways.
>
>Some interesting projects which re-write existing code might like
>other token information, like comments.
>
>People have requested features for pychecker, like detecting
>unnecessary parens and semicolons, which is not possible, since these
>are not part of the AST.

Again it seems like we have been overly focused on codegen. I'm looking
forward to seing Jeremy's thougths on adding parsetree info to the AST.

regards,
finn