PEP 269

Thu Sep 13 15:49:32 EDT 2001

Howdy all,
	I'm afraid Martin's attention to the PEP list has outted me
before I was able to post about this myself.  Anyway, for those
interested, I wrote a PEP for the exposure of pgen to the Python
interpreter.  You may view it at:

http://python.sourceforge.net/peps/pep-0269.html

	I am looking for comments on this PEP, and below, I address some
interesting issues raised by Martin.  Furthermore, I already have a
parially functioning reference implementation, and should be pestered to
make it available shortly.

Thanks,
-Jon

On Tue, 11 Sep 2001, Martin von Loewis wrote:

> Hi Jonathan,
>
> With interest I noticed your proposal to include Pgen into the
> standard library. I'm not sure about the scope of the proposed change:
> Do you view pgen as a candidate for a general-purpose parser toolkit,
> or do you "just" contemplate using that for variations of the Python
> grammar?

I am thinking of going for the low hanging fruit first (a Python centric
pgen module), and then adding more functionality for later releases of
Python (see below.)

> If the former, I think there should be a strategy already how
> to expose pgen to the application; the proposed API seems
> inappropriate. In particular:
>
> - how would I integrate an alternative tokenizer?
> - how could I integrate semantic actions into the parse process,
>   instead of creating the canonical AST?

The current change proposed is somewhat restrained by the Python 2.2
release schedule, and will initially only address building parsers that
use the Python tokenizer.  If the module misses 2.2 release, I'd like to
make it more functional and provide the ability to override the Python
tokenizer.  I may also add methods to export all the data found in the DFA
structure.

I am unsure what the purpose of integration of semantics into the parse
process buys us besides lower memory overhead.  In C/C++ such coupling is
needed because of the TYPEDEF/IDENTIFIER tokenization problem, but I
don't see Python and future Python-like, LL(1), languages needing such
hacks.  Finally, I am prone to enforce the separation of the backend
actions from the AST.  This allows the AST to be used for a variety of
purposes, rather than those intended by the initial parser developer.

> Of course, these questions are less interesting if the scope is to
> parse Python: in that case, Python tokenization is fine, and everybody
> is used to getting the Python AST.

An interesting note to make about this is that the since the nonterminal
integer values are generated by pgen, pgen AST's are not currently
compatible with the parser module AST's.  Perhaps such unification may be
slated for future work (I know Fred left room in the parser AST datatype
for identification of the grammar that generated the AST using an integer
value, but using this would be questionable in a "rapid parser
development" environment.)

> On the specific API, I think you should drop the File functions
> (parseGrammarFile, parseFile). Perhaps you can also drop the String
> functions, and provide only functions that expect file-like objects.

I am open to further discussion on this, but I would note that filename
information is used (and useful) when reporting syntax errors.  I think
that the "streaming" approach to parsing is another hold over from days
where memory constraints ruled (much like binding semantics to the parser
itself.)

> On the naming of the API functions: I propose to use an underscore
> style instead of the mixedCaps style, or perhaps to leave out any
> structure (parsegrammar, buildparser, parse, symbol2string,
> string2symbolmap). That would be more in line with the parser module.

I would like to hear more about this from the Pythonati.  I am currently
following the naming conventions I use at work, which of course is most
natural for me at home. :)

>
> Regards,
> Martin
>
>