Other notes

Fri Jan 7 16:31:25 EST 2005

On Fri, 07 Jan 2005 06:04:01 GMT, Andrew Dalke <dalke at dalkescientific.com> wrote:

>Bengt Richter:
>> But it does look ahead to recognize += (i.e., it doesn't generate two
>> successive also-legal tokens of '+' and '=')
>> so it seems it should be a simple fix.
>
>But that works precisely because of the greedy nature of tokenization.
So what happens if you apply greediness to a grammar that has both . and ..
as legal tokens? That's the point I was trying to make. The current grammar
unfortunately IMHO tries to tokenize floating point numbers, which creates a problem
for both numbers and (if you don't isolate it with surrounding spaces) the .. token.

There would UIAM be no problem recognizing 1 .. 2 but 1..2 has the problem that
the current tokenizer recognizes 1. as number format. Otherwise the greediness would
work to solve the 1..2 "problem."

IMHO it is a mistake to form floating point at the tokenizer level, and a similar mistake
follows at the AST level in using platform-specific floating point constants (doubles) to
represent what really should preserve full representational accuracy to enable translation
to other potential native formats e.g. if cross compiling. Native floating point should IMO
not be formed until the platform to which it is supposed to be native is identified.

IOW, I think there is a fix: keep tokenizing greedily and tokenize floating point as
a sequence of integers and operators, and let <integer><dot><integer> be translated by
the compiler to floating point, and <integer><dotdot><integer> be translated to the
appropriate generator expression implementation.

>Given "a+=2" the longest token it finds first is "a" because "a+"
>is not a valid token.  The next token is "+=".  It isn't just "+"
>because "+=" is valid.  And the last token is "2".
>
Exactly. Or I am missing something? (which does happen ;-)

>Compare to "a+ =2".  In this case the tokens are "a", "+", "=", "2"
>and the result is a syntax error.
Sure.
>
>>  >>> for t in tokenize.generate_tokens(StringIO.StringIO('a=b+c; a+=2; x..y').readline):print t
>>  ...
>
>This reinforces what I'm saying, no?  Otherwise I don't understand
>your reason for showing it.
It does reinforce your saying the matching is greedy, yes. But that led me to
think that .. could be recognized without a problem, given a grammar fix.
>
>>  (51, '+=', (1, 8), (1, 10), 'a=b+c; a+=2; x..y')
>
>As I said, the "+=" is found as a single token, and not as two
>tokens merged into __iadd__ by the parser.
No argument.
>
>After some thought I realized that a short explanation may be helpful.
>There are two stages in parsing a data file, at least in the standard
>CS way of viewing things.  First, tokenize the input.  This turns
>characters into words.  Second, parse the words into a structure.
>The result is a parse tree.
>
>Both steps can do a sort of look-ahead.  Tokenizers usually only look
>ahead one character.  These are almost invariably based on regular
>expressions.  There are many different parsing algorithms, with
>different tradeoffs.  Python's is a LL(1) parser.  The (1) means it
>can look ahead one token to resolve ambiguities in a language.
>(The LL is part of a classification scheme which summarizes how
>the algorithm works.)
>
>Consider if 1..3 were to be legal syntax.  Then the tokenizer
>would need to note the ambiguity that the first token could be
>a "1." or a "1".  If "1." then then next token could be a "."
>or a ".3".  In fact, here is the full list of possible choices
>
>  <1.> <.> <3>    same as getattr(1., 3)
>  <1> <.> <.> 3   not legal syntax
>  <1.> <.3>       not legal syntax
>  <1> <..> <3>    legal with the proposed syntax.
>
Right, but a grammar fix to handle floating point "properly" (IMHO ;-)
would resolve that. Only the last would be legal at the tokenizer stage.

>Some parsers can handle this ambiguity, but Python's
>deliberately does not.  Why?  Because people also find
I'm not sure what you mean. If it has a policy of greedy matching,
that does handle some ambiguities in a particular way.

>it tricky to resolve ambiguity (hence problems with
>precedence rules).  After all, should 1..2 be interpreted
>as 1. . 2 or as 1 .. 2?  What about 1...2?  (Is it 1. .. 2,
                 ^^^^^^[1]                          ^^^^^^^[2]
[1] plainly (given my grammar changes ;-)
[2] as plainly not, since <1> would not greedily accept '.' to make <1.>

>1 .. .2 or 1. . .2 ?)
 ^^^^^^[3]  ^^^^^^^[4]
[3] no, because greed would recognize an ellipsis <...>
[4] ditto. Greedy tokenization would produce <1> <...> <2> for the compiler.

Regards,
Bengt Richter