[Types-sig] A case study

Tim Peters tim_one@email.msn.com
Wed, 15 Dec 1999 05:08:54 -0500


[Guido]
> ...
> What do we now know about the type of the list variable?  It was
> initialized to an empty List.  It's still a List, and we know
> that at least some of its items are lists.  Are all its items
> lists?  This gets us into similar issues as the recursive call
> to find() before, and just as there, I'm not sure that we really
> do, so maybe we need to continue the single type hypothesis.  (One
> way out would be to assume the single type hypothesis until we
> see positive proof to the contrary, and if so, redo the analysis
> with a less restricted type.)

These are all std problems in dataflow analysis.  Conceptually:  you have a
program graph (rooted and directed), where nodes are basic blocks (single
entry, single exit), and arcs represent control flow.  Associate (still
conceptually) with every node a table mapping every name to the set of types
it may have upon entry to the block, and another table doing likewise for
block exit.  Initialize all these to empty sets (that is, replace your
"single type" hypothesis with the "no type" hypothesis!).

Traverse the graph.  Each block has certain effects on its exit type
mappings.  These need to propogate to the block's successors.  At each block
entry, the set of types a name maps to is just the union of the set of types
the name maps to at the exits of all predecessor blocks.  The root of a
function's graph is a slightly special case, in that the arglist acts like a
predecessor block for this purpose.

You continue propagating changes until you reach a steady state.  Meaning
that, for each node, the entry map equals the union of the predecessors'
exit maps, and the exit map is consistent with the entry map as modified by
the bindings in the block.

The hard parts are changing this from intuitive conception to efficient
implementation (global dataflow analysis can consume enormous amounts of
memory -- all blocks * all names * all functions * all modules == a whole
lot), and in crafting the type system so that you know a priori that you
*must* reach a steady state (e.g., it's probably not a good idea to say that
lists whose length is a prime number constitute "a type" <0.31 wink>).

Freebie:  if, at the end of this, there exists a block and a local name such
that the first occurrence of the name within the block is a reference, and
the name is still associated with the empty set in the block's entry map,
you've got an UnboundLocalError waiting to happen (provided the block is
reachable).

Semi-freebie:  If the first occurrence of a local name within a block is a
reference, and at least one of the block's predecessors associates this name
with the empty set in its exit map, you've got something very close to a
violation of Java's "definite assignment" rules.  That is, there is *a* path
in which this name may not be bound before reference; although you cannot,
in general, prove that it's *possible* for that path to occur at runtime.
Java gives a fatal error anyway, and after the first hour I came to like
that.

So your intuition is on the right track here.  What I can add as a former
Professional Compiler Writer is my Professional Assurance that making this
all run efficiently (in either time or space) is a Professional Pain in the
Professional Ass.  Because of this, global analysis never works out in
practice unless you invent an efficient database format to cache the results
of analysis, keeping that in synch with the source base under mutation.
It's all too easy to come up with a toy system that absolutely will not
scale to real life!  Python has an advantage, though, in that most people
write very small functions and methods most of the time.  If you can, in
addition, avoiding needing to deduce the types of most globals, it could
actually fly before we're all dead <wink>.

but-civilization-ends-in-a-few-weeks-anyway-ly y'rs  - tim