[Python-ideas] Built-in parsing library

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Wed Apr 10 00:13:35 EDT 2019


Barry Scott writes:

 > I'm not so sure that a "universal parsing library" is possible for
 > the stdlib.

That shouldn't be our goal.  (And I don't think Nam is wedded to that
expression of the goal.)

 > I think one way you could find out what the requirements are is to
 > refactor at least 2 of the existing stdlib modules that you have
 > identified as needing a better parser.

I think this is a really good idea.  I'll be sprinting on Mailman at
PyCon, but if Nam and other proponents have time around PyCon (and
haven't done it already :-) I'll be able to make time then.  Feel free
to ping me off-list.  (Meeting at PyCon would be a bonus, but IRC or
SNS messaging/whiteboarding works for me too if other interested folks
can't be there.)

 > Did you find that you could use the same parser code for both?

I think it highly likely that "enough" protocols and "little languages"
that are normally written by machines (or skilled programmers) can be
handled by "Dragon Book"[1] parsers to make it worth adding some
parsing library to the stdlib.  Of course, more general (but still
efficient) options have been developed since I last shaved a yacc, but
that's not the point.  Developers who have special needs (extremely
efficient parsing of a relatively simple grammar, more general
grammars) or simply want to continue using a different module that
they've been requiring from the Cheese Shop since it was called "the
Cheese Shop"[2] can (and should) do that.  The point of the stdlib is to
provide standard batteries that serve in common situations going forward.

I've been using regexps since 1980, and am perfectly comfortable with
rather complex expressions.  Eg, I've written more or less general
implementations of RFC 3986 and its predecessor RFC 2396, which is one
of the examples Nam has tried.  Nevertheless, there are some tricky
aspects (for example, I did *not* try to implement 3986 in one
expression -- as 3986 says:

    These restrictions result in five different ABNF rules for a path
    (Section 3.3), only one of which will match any given URI
    reference.

so I used multiple, mutually exclusive regexps for the different
productions).  There is no question in my mind that the ABNF is easier
to read.  Implementing a set of regexps from the ABNF is easier than
reconstructing the ABNF from the regexps.  That's *my* rationale for
including a parsing module in the stdlib: making common parsing tasks
more reliable in implementation and more maintainable.

To me, the real question is, "Suppose we add a general parsing library
to the stdlib, and refactor some modules to use it.  (1) Will this
'magically' fix some bugs/RFEs?  (Not essential, but would be a nice
bonus.)  (2) Will the denizens of python-ideas and python-dev find
such refactored modules readable and more maintainable than a plethora
of ad hoc recursive descent parsers?"

Obviously people who haven't studied parsers will have to project to a
future self that has become used to reading grammar descriptions, but
I think folks around here are used to that kind of projection.  This
would be a good test.

Footnotes: 
[1]  "Do I date myself?  Very well then, I date myself."

[2]  See [1].



More information about the Python-ideas mailing list