OT: Ultimate Language Syntax Cleanness Comparison

Sat Feb 8 12:29:47 EST 2003

Jim Richardson wrote:
> On 7 Feb 2003 19:55:35 -0800,
>  Jeremy Fincher <tweedgeezer at hotmail.com> wrote:
> > holger krekel <pyth at devel.trillke.net> wrote in message news:<mailman.1044658940.11235.python-list at python.org>...
> >> I was actually quite surprised to find out (together with a perl-friend)
> >> that there is no easy way to parse perl. All the methods involve 
> >> evaluating/executing it at the same time.  Cool, isn't it.
> > 
> > That's not true.  Perl *is* compiled to a bytecode format (what do you
> > think all the jazz about Parrot is for?)  There's no easy way to lex
> > Perl separate from parsing it.  Lexing and parsing Perl code is one in
> > the same.  Evaluating it is entirely separate.
> > 
> > Jeremy
> 
> 
> *raises hand in ignorance*
> 
> Can someone explain the differences between them? that is, evaluating,
> parsing and lexing? they seem pretty synonymic (ick!) to me, so I am
> missing something. What?

Sure.   It goes like this with python  (and often with other languages)

    lexing/tokenizing -> parsing -> compiling -> evaluating/executing

Let's go through it step by step.  Let's assume we have the string:

s="""if 1:
            print "hello"
"""

first we lex it into 'tokens':

>>> tokenize.tokenize(['',s].pop)
1,0-1,4:        INDENT  '    '
1,4-1,6:        NAME    'if'
1,7-1,8:        NUMBER  '1'
1,8-1,9:        OP      ':'
1,9-1,10:       NEWLINE '\n'
1,22-1,27:      NAME    'print'
1,28-1,35:      STRING  '"hello"'
1,35-1,36:      NEWLINE '\n'
2,0-2,0:        DEDENT  ''
2,0-2,0:        ENDMARKER       ''

So now you have the 'tokens' and the parser e.g. doesn't have to
care about whitespace among other things.  Note that the
tokenizer spills out 'INDENT' and 'DEINDENT' tokens because
they are significant in Python.  But the details are abstracted
out. 

Next the parser:

>>> import compiler
>>> compiler.parse(s)
Module(None, 
       Stmt([If([(Const(1), Stmt([Printnl([Const('hello')],
           None)]))], None)]))   # slightly reformated
>>>

this already reflects the *Syntax* of our little if-statement.
The compiler takes this syntax tree and compiles it into bytecode:

>>> compiler.compile(s, '', 'exec').
<code object <module> at 0x827efb0, file "", line 1>
>>> compiler.compile(s, '', 'exec').co_code
'\x7f\x00\x00\x7f\x01\x00d\x01\x00o\x0c\x00\x01\x7f\x02\x00d\x02\x00GHn\x01\x00\x01d\x00\x00S'
>>>

And here you see the actual bytecode that is executed by
the Python-Interpreter (aka VM).  You can get a human readable
version by

>>> dis.dis(compiler.compile(s, '', 'exec'))
          0 SET_LINENO               0

          3 SET_LINENO               1
          6 LOAD_CONST               1 (1)
          9 JUMP_IF_FALSE           12 (to 24)
         12 POP_TOP

         13 SET_LINENO               2
         16 LOAD_CONST               2 ('hello')
         19 PRINT_ITEM
         20 PRINT_NEWLINE
         21 JUMP_FORWARD             1 (to 25)
    >>   24 POP_TOP
    >>   25 LOAD_CONST               0 (None)
         28 RETURN_VALUE
>>>

Does that clarify a bit what  the steps

    tokenizing -> parsing -> compiling -> executing

mean?

    holger