[Python-ideas] Hooking between lexer and parser

Neil Girdhar mistersheik at gmail.com
Mon Jun 8 04:47:16 CEST 2015


On Sun, Jun 7, 2015 at 10:42 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 8 June 2015 at 12:23, Neil Girdhar <mistersheik at gmail.com> wrote:
> >
> >
> > On Sun, Jun 7, 2015 at 1:59 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> >>
> >> On 7 June 2015 at 08:52, Andrew Barnert via Python-ideas
> >> <python-ideas at python.org> wrote:
> >> > Also, if we got my change, I could write code that cleanly hooks
> parsing
> >> > in
> >> > 3.6+, but uses the tokenize/untokenize hack for 2.7 and 3.5, so people
> >> > can
> >> > at least use it, and all of the relevant and complicated code would be
> >> > shared between the two versions. With your change, I'd have to write
> >> > code
> >> > that was completely different for 3.6+ than what I could backport,
> >> > meaning
> >> > I'd have to write, debug, and maintain two completely different
> >> > implementations. And again, for no benefit.
> >>
> >> I don't think I've said this explicitly yet, but I'm +1 on the idea of
> >> making it easier to "hack the token stream". As Andew has noted, there
> >> are two reasons this is an interesting level to work at for certain
> >> kinds of modifications:
> >>
> >> 1. The standard Python tokeniser has already taken care of converting
> >> the byte stream into Unicode code points, and the code point stream
> >> into tokens (including replacing leading whitespace with the
> >> structural INDENT/DEDENT tokens)
> >
> >
> > I will explain in another message how to replace the indent and dedent
> > tokens so that the lexer loses most of its "magic" and becomes just like
> the
> > parser.
>
> I don't dispute that this *can* be done, but what would it let me do
> that I can't already do today? I addition, how will I be able to
> continue to do all the things that I can do today with the separate
> tokenisation step?
>
> *Adding* steps to the compilation toolchain is doable (one of the
> first things I was involved in CPython core development was the
> introduction of the AST based parser in Python 2.5), but taking them
> *away* is much harder.
>
> You appear to have an idealised version of what a code generation
> toolchain "should" be, and would like to hammer CPython's code
> generation pipeline specifically into that mould. That's not the way
> this works - we don't change the code generator for the sake of it, we
> change it to solve specific problems with it.
>
> Introducing the AST layer solved a problem. Introducing an AST
> optimisation pass would solve a problem. Making the token stream
> easier to manipulate would solve a problem.
>
> Merging the lexer and the parser doesn't solve any problem that we have.
>

You're right.  And as usual, Nick, your analysis is spot on.  My main
concern is that the idealized way of parsing the language is not precluded
by any change.  Does adding token manipulation promise forwards
compatibility?  Will a Python 3.9 have to have the same kind of token
manipulator exposed.  If not, then I'm +1 on token manipulation. :)


>
> >> 2. You get to work with a linear stream of tokens, rather than a
> >> precomposed tree of AST nodes that you have to traverse and keep
> >> consistent
> >
> > The AST nodes would contain within them the linear stream of tokens that
> you
> > are free to work with.  The AST also encodes the structure of the tokens,
> > which can be very useful if only to debug how the tokens are being
> parsed.
> > You might find yourself, when doing a more complicated lexical
> > transformation, trying to reverse engineer where the parse tree nodes
> begin
> > and end in the token stream.  This would be a nightmare.  This is the
> main
> > problem with trying to process the token stream "blind" to the parse
> tree.
>
> Anything that cares about the structure to that degree shouldn't be
> manipulating the token stream - it should be working on the parse
> tree.
>
> >> If all you're wanting to do is token rewriting, or to push the token
> >> stream over a network connection in preference to pushing raw source
> >> code or fully compiled bytecode, a bit of refactoring of the existing
> >> tokeniser/compiler interface to be less file based and more iterable
> >> based could make that easier to work with.
> >
> > You can still do all that with the tokens included in the parse tree.
>
> Not as easily, because I have to navigate the parse tree even when I
> don't care about that structure, rather than being able to just look
> at the tokens in isolation.
>

I don't think it would be more of a burden than it would prevent bugs by
allowing you to ensure that the parse tree structure is what you think it
is.  It's a matter of intuition I guess.

>
> Regards,
> Nick.
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150607/1e254b34/attachment-0001.html>


More information about the Python-ideas mailing list