[Types-sig] A case study

Tue, 14 Dec 1999 10:19:52 -0500

Here's a long and rambling example of what I think a type inferencer
could do -- without type declarations of any sort.  I wrote this down
while thinking about the type checker that I would like to see in
IDLE.

--Guido van Rossum (home page: http://www.python.org/~guido/)

"""Let's analyze a simple program.  Here's an example script -- let's
call it pyfind.py -- that prints the names of all Python files in a
given directory tree."""

#----------------------------------------------------------------------
import sys, find

def main():
    dir = "."
    if sys.argv[1:]:
        dir = sys.argv[1]
    list = find.find("*.py", dir)
    list.sort()
    for name in list:
        print name

if __name__ == "__main__":
    main()
#----------------------------------------------------------------------

"""Our task is to check whether this is a correct program.  I won't
define correctness rigidly here, but it has something to do with under
what circumstances the program will execute to completion.

At the top level, we see an import statement, a function definition,
and an if statement.

Analyzing the import statement, we notice that sys is a well-known
standard module.  The find module is in the standard library.  (In
Python 1.6 it will be obsolete, but it's a convenient example.)

Let's have a look at find.py just to see if there's any weirdness
there:"""

#----------------------------------------------------------------------
import fnmatch
import os

_debug = 0

_prune = ['(*)']

def find(pattern, dir = os.curdir):
    list = []
    names = os.listdir(dir)
    names.sort()
    for name in names:
        if name in (os.curdir, os.pardir):
            continue
        fullname = os.path.join(dir, name)
        if fnmatch.fnmatch(name, pattern):
            list.append(fullname)
        if os.path.isdir(fullname) and not os.path.islink(fullname):
            for p in _prune:
                if fnmatch.fnmatch(name, p):
                    if _debug: print "skip", `fullname`
                    break
            else:
                if _debug: print "descend into", `fullname`
                list = list + find(pattern, fullname)
    return list
#----------------------------------------------------------------------

"""This imports two more modules, and then defines two variables and a
function.  Module os is a standard library module with special status.
It's written in Python, but its source code is actually pretty hairy
and dynamic; we can assume that its effective behavior can be
hardcoded in the analyzer somehow, or perhaps we show the analyzer an
idealized version of its source code.  (This is an example of one
trick our analyzer can use to make its life easier.  It's equivalent
to the concept of a "lint library" for the Unix/C lint tool.)

Let's look at the fnmatch source code (I've left out some doc
strings):"""

#----------------------------------------------------------------------
import re

_cache = {}

def fnmatch(name, pat):
	import os
	name = os.path.normcase(name)
	pat = os.path.normcase(pat)
	return fnmatchcase(name, pat)

def fnmatchcase(name, pat):
	if not _cache.has_key(pat):
		res = translate(pat)
		_cache[pat] = re.compile(res)
	return _cache[pat].match(name) is not None

def translate(pat):
	i, n = 0, len(pat)
	res = ''
	while i < n:
		c = pat[i]
		i = i+1
		if c == '*':
			res = res + '.*'
		elif c == '?':
			res = res + '.'
		elif c == '[':
			j = i
			if j < n and pat[j] == '!':
				j = j+1
			if j < n and pat[j] == ']':
				j = j+1
			while j < n and pat[j] != ']':
				j = j+1
			if j >= n:
				res = res + '\\['
			else:
				stuff = pat[i:j]
				i = j+1
				if stuff[0] == '!':
					stuff = '[^' + stuff[1:] + ']'
				elif stuff == '^'*len(stuff):
					stuff = '\\^'
				else:
					while stuff[0] == '^':
						stuff = stuff[1:] + stuff[0]
					stuff = '[' + stuff + ']'
				res = res + stuff
		else:
			res = res + re.escape(c)
	return res + "$"
#----------------------------------------------------------------------

"""This in turn imports the re module.  This one is a bit too long to
include here; let's assume that, like the os module, it's known as a
special case to the analyzer.  Just a variable initialization and
three function definitions here, no other executable code.

Let's go back to the top-level script.  I think the analyzer can
easily recognize the idiom ``if __name__ == "__main__": ...'': it can
know that since this is the root of the program, __name__ is indeed
equal to "__main__", so it knows that main() gets called.

Now we need to analyze main() further.  Here it is again, with line
numbers:"""

#----------------------------------------------------------------------
def main():                             # 1
    dir = "."                           # 2
    if sys.argv[1:]:                    # 3
        dir = sys.argv[1]               # 4
    list = find.find("*.py", dir)       # 5
    list.sort()                         # 6
    for name in list:                   # 7
        print name                      # 8
#----------------------------------------------------------------------

"""In line 1, we see that there are no arguments.  Line 2 initializes
the variable dir with the constant value ".", so we know its type is a
string at this point.

There's one other assignment to dir, on line 4.  How do we know that
this is also assigning a string to dir?  My reasoning as the human
reader of the program is that sys.argv is initially a list of strings,
so sys.argv[x] for any x either raises an exception or yields a string
value.  The initial type of sys.argv can be known to the analyzer.

How does the analyzer know that no other code has assigned anything
different to sys.argv?  Exhaustive analysis can probably show that
there are no assignments to sys.argv or its items (or slices) anywhere
in all the modules used by the program, nor are there calls to any of
the list-modifying methods.  We may be able to restrict ourselves to
the code that may already have run by the time we reach this statement
-- but then we have to prove that there are no other calls to main().

Maybe the analyzer can be primed with special knowledge about
sys.argv, e.g. that its type cannot change.  Then statements that
cannot be proved to keep its type the same can be flagged as errors.
Of course this gets muddy in the light of aliasing -- we'd need to
keep around the information that some variable might point to
sys.argv.  Fortunately that kind of information seems useful in
general.

OK, so we know that dir is a string.  Let's go back to line 3: the
checker notes that sys is an imported module (the sys module) which
has indeed an argv argument that is sliceable.  It will also note that
the expression ``1'' has the type integer which is a valid slice
index.  (There are no out-of-bounds exceptions for slice indices.)

In line 4, we need to have another look at the expression
``sys.argv[1]''.  (Note: I'm not saying that the analyzer jumps around
haphazardly like this, I'm just making a case for what kinds of
processes typically go on in the analyzer.  In reality it probably
goes at it in a much more orderly fashion.)  Again, sys.argv is
recognized as a list, so it can be indexed, and the expression ``1''
has the correct type.  Now, indexing may cause an IndexError
exception.  Can the checker prove that we won't (ever!) get an
IndexError at this particular line because of the test in the if
statement on the previous line?  I think that may be asking a bit
much.  But it knows that if the index is valid, the result will be a
string (see above).

Next, line 5.  Here the analyzer knows that find is a module we
imported, and that find.find is a function defined in that module.  We
call it with two arguments.  The first is a string literal; the second
is our local variable dir, which is also a string.  Let's have a look
at the function definition again:"""

#----------------------------------------------------------------------
def find(pattern, dir = os.curdir):                                    #  1
    list = []                                                          #  2
    names = os.listdir(dir)                                            #  3
    names.sort()                                                       #  4
    for name in names:                                                 #  5
        if name in (os.curdir, os.pardir):                             #  6
            continue                                                   #  7
        fullname = os.path.join(dir, name)                             #  8
        if fnmatch.fnmatch(name, pattern):                             #  9
            list.append(fullname)                                      # 10
        if os.path.isdir(fullname) and not os.path.islink(fullname):   # 11
            for p in _prune:                                           # 12
                if fnmatch.fnmatch(name, p):                           # 13
                    if _debug: print "skip", `fullname`                # 14
                    break                                              # 15
            else:                                                      # 16
                if _debug: print "descend into", `fullname`            # 17
                list = list + find(pattern, fullname)                  # 18
    return list                                                        # 19
#----------------------------------------------------------------------

"""Indeed the function takes two arguments.  It's also reassuring that
the second argument has a default argument of type string (os.curdir
is easily recognized as a string, using similar reasoning as for
sys.argv).

Now let's analyze it further.

Line 2 defines a local variable list initialized to an empty list.
Will it always have the type List, throughout this function?  There's
an assignment further down to this variable from the expression ``list
+ find(...)''.  How can we prove that the type of that variable is
List?  I can show it in any of two ways, neither is very satisfactory:

    1. List objects support a + operator only with a List right
    operand, and the result is a List.  The problem with this is that
    the right operand might be a class instance that defines __radd__
    and returns something else, so it's not a valid proof.

    2. Using induction: if recursion level N returns a List, the
    assignment ``list = list + find(...)'' shows that recursion level
    N+1 has type List; recursion level 0 (no recursive calls) has
    return type List; so all recursion levels have type List.  This is
    a valid proof (though I should write it down more carefully) but
    I'm not sure if I can assume that the analyzer is smart enough to
    deduce it!

I'm not sure how to rescue myself out of this conundrum; perhaps
there's value in John Aycock's assumption that variables typically
don't change their type unless shown otherwise; then we could assume
list was a List throughout.  Still thin ice, but this is a common
pattern.

The rest is a bit simpler.  Line 3 calls a known system function
taking a string and returning a list of strings, or raising os.error.
We remember that the dir argument is a string so this is valid.  We
also note that this might raise os.error.  As indeed it will when we
pass it a non-directory as a command line argument.  The return type
shows us that the local variable names is set to a list of strings.
Line 3 sorts that list.  The analyzer should know that this
calls the built-in function cmp() pairwise on items of the list;
comparing strings is fine so there's no chance of an error here.

In line 5 we iterate over names, which is a list of strings, so we
know name is a string.

Plodding along: ``name in (os.curdir, os.pardir)'' is a valid test;
the rhs of the in operator is a tuple of strings and we know that the
in operator calls cmp(name, x) for each x in the tuple; again, this is
fine.

Line 8, ``fullname = os.path.join(dir, name)'': we can know that
os.path.join is a function of 1 or more string arguments returning a
string; the arguments are both strings and we now know that the local
variable fullname is a string, too.

Line 9 calls fnmatch.fnmatch().  I postulate that it's obvious that
this takes two arguments and returns a Boolean:"""

#----------------------------------------------------------------------
def fnmatch(name, pat):
	import os
	name = os.path.normcase(name)
	pat = os.path.normcase(pat)
	return fnmatchcase(name, pat)

def fnmatchcase(name, pat):
	if not _cache.has_key(pat):
		res = translate(pat)
		_cache[pat] = re.compile(res)
	return _cache[pat].match(name) is not None
#----------------------------------------------------------------------

"""I leave it as an exercise that the argument types (strings again)
are correct and that no other errors can occur.

Line 10 calls list.append(fullname).  We know that list is a List
object and that fullname is a string.  We should also know the effect
of a list's append method; the call is correct (it takes one argument
of arbitrary type).

What do we now know about the type of the list variable?  It was
initialized to an empty List.  It's still a List, and we know that at
least some of its items are lists.  Are all its items lists?  This
gets us into similar issues as the recursive call to find() before,
and just as there, I'm not sure that we really do, so maybe we need to
continue the single type hypothesis.  (One way out would be to assume
the single type hypothesis until we see positive proof to the
contrary, and if so, redo the analysis with a less restricted type.)

I'll leave the rest of this function as an exercise; no new principles
are employed.  Note that _prune is a global variable initialized to a
list of strings, and, with John, we'll assume that that is its final
type; this makes everything work.

This function is recursive.  Could we prove that the recursion will
terminate?  Probably not; it would require knowing filesystem
properties.  Note that if it weren't for the os.path.islink() test, it
would be possible to create a structure in the filesystem that would
cause infinite recursion here!  So our analyzer might flag this
function as questionable recursive.

We can now finish the analysis of our original main function:"""

#----------------------------------------------------------------------
def main():                             # 1
    dir = "."                           # 2
    if sys.argv[1:]:                    # 3
        dir = sys.argv[1]               # 4
    list = find.find("*.py", dir)       # 5
    list.sort()                         # 6
    for name in list:                   # 7
        print name                      # 8
#----------------------------------------------------------------------

"""We know that find.find() returns a list of strings, so this is the
type of the list variable.  We already talked about sorting a list of
strings.  I can probably prove that sorting here is redundant, given
the way find() sorts its list of names, but that will be hard for the
analyzer, so I doubt that it will find this subtle optimization tip.

The final for loop and print statement have no further problems; we
know that list is a List, which is a sequence, so a for loop can
iterate over it; the print statement calls str() on each of its
arguments and this function is always safe on strings (as it does for
most types, except instances or extension types that raise exception
in their __str__ implementation)."""