Other notes

Andrew Dalke dalke at dalkescientific.com
Fri Jan 7 01:04:01 EST 2005


Bengt Richter:
> But it does look ahead to recognize += (i.e., it doesn't generate two
> successive also-legal tokens of '+' and '=')
> so it seems it should be a simple fix.

But that works precisely because of the greedy nature of tokenization.
Given "a+=2" the longest token it finds first is "a" because "a+"
is not a valid token.  The next token is "+=".  It isn't just "+"
because "+=" is valid.  And the last token is "2".

Compare to "a+ =2".  In this case the tokens are "a", "+", "=", "2"
and the result is a syntax error.

>  >>> for t in tokenize.generate_tokens(StringIO.StringIO('a=b+c; a+=2; x..y').readline):print t
>  ...

This reinforces what I'm saying, no?  Otherwise I don't understand
your reason for showing it.

>  (51, '+=', (1, 8), (1, 10), 'a=b+c; a+=2; x..y')

As I said, the "+=" is found as a single token, and not as two
tokens merged into __iadd__ by the parser.

After some thought I realized that a short explanation may be helpful.
There are two stages in parsing a data file, at least in the standard
CS way of viewing things.  First, tokenize the input.  This turns
characters into words.  Second, parse the words into a structure.
The result is a parse tree.

Both steps can do a sort of look-ahead.  Tokenizers usually only look
ahead one character.  These are almost invariably based on regular
expressions.  There are many different parsing algorithms, with
different tradeoffs.  Python's is a LL(1) parser.  The (1) means it
can look ahead one token to resolve ambiguities in a language.
(The LL is part of a classification scheme which summarizes how
the algorithm works.)

Consider if 1..3 were to be legal syntax.  Then the tokenizer
would need to note the ambiguity that the first token could be
a "1." or a "1".  If "1." then then next token could be a "."
or a ".3".  In fact, here is the full list of possible choices

  <1.> <.> <3>    same as getattr(1., 3)
  <1> <.> <.> 3   not legal syntax
  <1.> <.3>       not legal syntax
  <1> <..> <3>    legal with the proposed syntax.

Some parsers can handle this ambiguity, but Python's
deliberately does not.  Why?  Because people also find
it tricky to resolve ambiguity (hence problems with
precedence rules).  After all, should 1..2 be interpreted
as 1. . 2 or as 1 .. 2?  What about 1...2?  (Is it 1. .. 2,
1 .. .2 or 1. . .2 ?)


				Andrew
				dalke at dalkescientific.com




More information about the Python-list mailing list