[Python-3000] Refactoring tool available (work in progress)

Fri Dec 15 06:51:25 CET 2006

In the sandbox I've been working on a refactoring tool, which could
form the basis for a Python 2.x -> 3.0 conversion tool.  I'd like to
invite folks here to give it a try and give me a hand. It certainly
needs more work, but I think that the basic infrastructure is sound.
Check out sandbox/2to3/:
http://svn.python.org/view/sandbox/trunk/2to3/.

This message is to invite feedback, and to encourage contributions. It
would be great if people tried their hands at writing new
transformations!

A brief description of how it works:

We start with a slightly modified version of tokenize.py. The tokens
are passed to a parsing engine (the Python re-implementation of pgen
that I wrote at Elemental), also slightly modified, which builds up a
concrete parse tree. Through some simple tricks (this is what I had to
modify tokenize.py for), the parse tree contains annotations that hold
on to all comments and whitespace between tokens. This allows us to
reconstruct the original source *exactly*. (I tested this on all .py
files in the Python distribution and every single file got
reconstructed exactly.)

Once we've build a parse tree for a file we can start refactoring.
This is done by traversing the parse tree looking for nodes that match
a given pattern; for matching nodes, we invoke a transformation method
that returns a replacement node (or None if further inspection shows
we can't actually do it despite the pattern matching); then this node
is grafted into the tree in place of the originally matching node.

Originally I created matching patterns by constructing a little tree
of "pattern nodes". This quickly got tedious, and I designed a simple
language for creating patterns (inspired by the language used to
describe Python's grammar). The transformation routine is still just
Python code; I would like to be able to use a more compact notation
for this too, but it's more complicated since (in practice) there is a
lot of ad-hoc testing that goes on. Well, have a look at the three
example transformations in the code (in the "fixes" subdirectory).

There's a driver program (refactor.py) that lets you repeat this
process for any number of files or for all Python files in a given
directory tree, with command line switches to select which
transformations to apply and whether to actually write back the
transformed files or just show the diffs that *would* be applied (the
default).

Let me describe how the "apply" transformation works, from a high
level. This should change

  apply(f, x, y)

into

  f(*x, **y)

(and similar if y is missing).

The pattern looks like this, in first approximation:

  power< 'apply' trailer< '(' arglist< any ',' any [',' any] > ')' > >

This references the following rules from the Python grammar (I'm only
showing the parts that matter for this example):

  power: atom trailer* ['**' factor]
  atom: NAME | ...
  trailer: '.' NAME | '[' subscriptlist ']' | '(' [arglist] ')'
  arglist: argument (',' argument)*

In the pattern language, "power<...>" means "match a node of type
'power' whose subnodes match ...". In this example, the ... is 'apply'
trailer<...>; then those ... are filled in with '(' arglist<...> ')'
and finally those ... are: any ',' any [',' any]. Here 'any' is a
special token that matches every possible node, and [...] indicates an
optional part. Stuff in quotes of course means literal tokens; we use
this to match specific keywords, identifiers or operators.

Patterns match only the formal grammar; they completely ignore the
whitespace and comments that are included in the parse tree as
annotations.

The pattern language has an additional useful feature: you can
associate names with subpatterns, and the pattern matching process (if
successful) will return a dictionary that maps those names to the tree
nodes that matched those subpatterns. The syntax for this is currently
to prefix the subpattern with "NAME="; I'm not sure if this is the
most readable syntax but it's the best I could come up with.

The final pattern language feature is negative look-ahead matches;
something like (not xyzzy) forces the match to fail if the pattern
xyzzy matches at this point. (I wanted the syntax to be !xyzzy but
tokenizer.py doesn't like that because ! isn't a valid Python
delimiter.)

So now we have identified a node that matches the pattern. The
transformation extracts the subnodes corresponding to f, x and y (if
present), and then constructs a new node of the form f(*x, **y).
There's one complication worth mentioning: when the original code is
apply(f+y, a, b) then the transformation must be (f+y)(a, b). We add
the parentheses only when really needed though.

There are some unsolved problems. First of all, we don't have a symbol
table. If apply is redefined in a particular scope, we don't notice
this and still apply the transformation. And of course apply is a
relatively simple case (being a free variable); has_key is more
complicated, e.g. imagine x.y.z.has_key(a).

Second, the transformations can sometimes lose comments. For example,
if the input looks something like

  (x.has_key  #blah
     (y))

the #blah comment is lost, since there is no good place to put it in
the translation "(y in x)". Part of this problem is solvable by making
the transformations preserve comments more carefully; but sometimes
there just isn't a good place to put them. Maybe this is rare enough
in practice that we could just leave such cases untranslated, instead
flagging them as code that requires manual conversion.

I also expect to run into problems with transformations for things
like .keys(); right now the best solution I can think of is to
translate xxx.keys() into list(xxx.keys()) and translate
xxx.iterkeys() into iter(xxx.keys()). That's ugly; but I can't think
of anything better without doing a ton of whole-program analysis,
which I'd like to avoid. The .keys() transformation is also likely to
make the tool non-idempotent; that's perhaps unavoidable, but still
unfortunate, since it means you have to keep track of which files
already have received which transformations.

Finally, I have a dream: a GUI that will let you do this
interactively, sort of like query-replace in Emacs. But this message
is already too long, so I'll stop for now. Thanks for reading this
far. :-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)