Atoms, Identifiers, and Primaries

Wed Apr 17 14:43:43 EDT 2013

Wow, that's some impressive wall of text! Splitting your comments up into a
few paragraphs would make it much easier to read :-)

My comments below...

On Wed, 17 Apr 2013 10:15:02 -0700, Bruce McGoveran wrote:

> Thank you all for your thoughtful replies.  I appreciate your collective
> insight.  I didn't mean to cast the concept of recursion in a negative
> light - I'm actually comfortable with the concept, at least to some
> extent, and I appreciate the need for its use in this documentation.  I
> also appreciate the need to play with expressions at the command line,
> to gain a feel for how expressions are evaluated.  My interest in the
> language's formal description arises merely out of a desire to
> understand as precisely as possible what happens when I hit enter at the
> command line, or when I run a module. 

You won't gain that from the *grammar* of the language. Grammar is only part
of the story, and in some ways, the least important part. If I tell you
that the grammar of English includes:

ADJECTIVE NOUN

that alone is not going to help you understand the differences between
a "wise man" and a "wise guy", or "peanut oil" and "baby oil".

I'm not saying that syntax and grammar is unimportant, but it is independent
of the *semantics* of the program, and really its the semantics (the
meaning) of code that is important. One can easily imagine four languages
where the identical operation was written as:

name.attribute
name->attribute
name(attribute)
attribute of name

Contrariwise, just because two languages have ostensibly the same syntax for
something, doesn't mean they are doing the same thing. E.g. there are
subtle differences between attribute lookup name.attribute in Java and
Python.

> Your answers to my initial
> questions in this thread and the ones I posed in another thread
> ("Understanding Boolean Expressions") have lead me to some follow-up
> questions.  Suppose I'm working at the command line, and I bind x to the
> value 1 and y to the value 0.  Suppose I next type x and y and hit
> enter.  Python returns 0 (zero).  I'm glad I checked this before sending
> in this post because I thought it would return a value of False based on
> the presence of the and operand.

The command line is actually irrelevant here. With one or two minor
exceptions, Python will do the same thing whether you are working
interactively or not. The main differences are that at the interactive
prompt, Python will automatically print the result of any expression which
is not otherwise bound to a value, and also bind it to the variable
name "_". (A single underscore.) Other interactive command prompts may do
more, or less.

> My question:  what did the interpreter
> have to do to evaluate the expression x and y and return a value of
> zero? I know the lexical analyzer has to parse the stream of characters
> into tokens.  I presume this parsing generates the toxens x, y, and, and
> a NEWLINE.  

Well, yes, but you're being awfully reductionist here. I'm the first to be
in favour of curiosity for curiosity's sake, but I'm not sure that getting
bogged down at such a low level this early in your Python learning
experience is a good idea. *shrug* No skin off my nose though.

The answer is going to depend on the implementation. There are at least four
major implementations of Python these days, and another dozen or two
obsolete, experimental or minor implementations. (Oh, and let me apologise
in advance to anyone whose implementation I haven't listed as "major".)

CPython is the implementation you are probably using; Jython runs on top of
the Java virtual machine, IronPython runs on top of the Dot Net virtual
machine, and PyPy runs on the deepest, darkest voodoo known to Mankind. But
essentially, any implementation will have to perform all or most of these
steps:

* Parse the source code into tokens. CPython generates an AST, Abstract
Syntax Tree. What that means in practice, I have no idea. This is
relatively new: some versions back, the syntax was essentially identical,
but there was no AST involved.

* From the tokens, or the AST, generate byte code. Or machine code, if your
compiler is clever enough.

* Execute the byte code in some virtual machine, or the machine code
directly on your CPU, as the case may be.

You can view the byte code using the dis module, e.g.:

py> import dis
py> code = compile('x = 1; y = 0; print x and y', '', 'exec')
py> dis.dis(code)
  1           0 LOAD_CONST               0 (1)
              3 STORE_NAME               0 (x)
              6 LOAD_CONST               1 (0)
              9 STORE_NAME               1 (y)
             12 LOAD_NAME                0 (x)
             15 JUMP_IF_FALSE_OR_POP    21
             18 LOAD_NAME                1 (y)
        >>   21 PRINT_ITEM
             22 PRINT_NEWLINE
             23 LOAD_CONST               2 (None)
             26 RETURN_VALUE

If you run this under another implementation of Python, such as WPython, or
even a different version of CPython, you may get completely different byte
code. But the *semantics* of what it does must be identical, otherwise it
isn't Python.

> Beyond that, things get a little fuzzy, and it occurs to me
> that this fuzziness is the result of my looking at the expression x and
> y knowing full well what each token means and what I want done with
> them, whereas the interpreter won't know these things until it can parse
> the character stream and sort the tokens into some recognizable (and
> syntactically correct) order. 

No. Python will not sort the tokens into syntactically correct order. It
will only take them in the order you provide them. If you write code in the
wrong syntax:

x y <=  # I prefer to write reverse Polish notation

you will get a syntax error, Python won't rearrange the tokens into the
correct order x <= y.

> As I look at it, the expression x and y
> has two atoms, namely x and y.  x and y are also primaries, and they
> represent the most tightly bound parts of this expression (meaning they
> bind more tightly to their underlying objects than to the and operator).
>   Incidentally, how does Python figure out that the x and y in this
> expression refer to the x and y I previously bound to integer values?

Ah, now you're finally asking the right questions! This is the important
part: Python's execution model. All that stuff about parsing source code
and generating ASTs and byte-code is just the mechanics that are needed to
make it work. In principle, you could simulate a Python interpreter in your
head, or build a CPU that executed Python code directly, and avoid
everything except the most simple parser. The behaviour of the interpreter
will be the same, regardless of how much work it does.

In this case, Python does a runtime lookup in the current namespace for
names "x" and "y", retrieve the appropriate values, or raise a NameError if
they don't exist. In pseudo-code:

get the current namespace
if 'x' is not in the namespace: raise NameError
if 'y' is not in the namespace: raise NameError
get the value of x in the current namespace
get the value of y in the current namespace
perform the "and" operator on those two values

(more or less).

Python's actual execution model is based on a stack, like Forth, but really,
unless you're trying to read byte code, you will never need to know this.
It's irrelevant to the day-to-day understanding of Python code.

> I know there's a symbol table in each execution frame.  How does Python
> know to go to that table and check for x and y?

Because some person, many years ago, programmed the Python compiler to do
that.

> The and token represents
> an operator, a boolean operator to be specific.  As I look at the
> grammar for and_test in section 5.10 of the documentation, it would
> appear that the and_test resolves via not_test's definition to two
> comparisons, which in turn resolve to or_expr, and then via a series of
> binary bitwise definitions to shift_expr, then to a_expr, then to
> m_expr, then to u_expr, to power, and then primary, and then to atom,
> which lands us finally at non-terminal identifiers (i.e. x and y
> themselves). Questions:  In working through these steps, what I have
> actually demonstrated?  Is this how Python deconstructs an and statement
> with two operands?

Absolutely not. All you're doing is focusing on the formal grammar, instead
of either the Python virtual machine, or it's execution model.

> Do I take from the fact that the progression from
> and_test to identifier involved reference to bitwise operators that the
> boolean testing of x and y involves a bitwise comparison of x and y?

Absolutely not.

What you actually need to know is much more high-level than this. When
Python evaluates "x and y", it runs pseudo-code something like this:

* if x is not a true-like value, return x
* return y

This is short-cut semantics of boolean "and", with truthy/falsey values
instead of just a pair of True/False values.

You won't learn this from the grammar, because it isn't part of the grammar.
It is part of the execution model of the language.

How does Python determine whether x is "true-like"? It asks x, "are you
true?". If x has a __bool__ method, it calls x.__bool__(). Otherwise, if x
has a __len__ method, it calls x.__len__() and compares the result to zero.
Otherwise, it (somewhat arbitrarily, but reasonably) declares that since x
is an object, it is something rather than nothing and therefore it must be
true-ish.

I hope this helps, and expect it will raise as many questions as it answers!

-- 
Steven