[Python-ideas] Idea: The DSL operator.

Wed May 30 03:22:01 CEST 2007

This is in response to the various suggestions for embedding SQL
syntax in Python and various other suggestions made over the years.

Note that this suggestion is only a starting point for discussion, I
don't have any illusion that the idea as presented here would have
strong support. I'm fully aware that the actual proposal given below
looks "butt-ugly", but maybe someone can invent something better.

Background: Python is frequently used as a foundation for Domain
Specific Languages (DSLs). Examples include SQLObject, SCons, and many
others. These are all examples of a mini-language which is constructed
on top of the Python syntax, using operator overloading. The DSL is
often embedded within Python code (as in the case of SQLObject), and
can be invoked programmatically from regular (non-DSL) Python code.

However, there are some limitations on what can be expressed. For
example, in current interpreters, it is not possible to override the
'and' and 'or' operator. While there currently is a PEP that proposes
adding this capability, there is some disagreement as to how much this
would impact the efficiency of the non-override case. The reason is
because in the non-override case, the compiler is able to take
advantage of the "shortcut" properties of these operators; If the
operators can be overridden (which is a post-compiler operation), then
the compiler would no longer be able to make these optimization
assumptions.

A similar limitation in DSLs is the treatment of unbound identifiers.
Identifiers which have already been bound to objects are not a
problem, since you can simply override the various operators of those
objects. However, for identifiers which have not yet been assigned to
a variable, a number of ugly hacks are required. Typically what is
done is to have a set of "special" objects which act as placeholders
for unbound variable names, for example _x, _y, _z, and so on.
Unfortunately, this requires the person writing the DSL expression to
pre-declare unbound variables, which is exactly the opposite of how
regular Python works. Another method that is sometimes used is to
declare unbound variables as attributes of some general "unbound
variable" object, so you can say something like Query.x, Query.y, and
so forth. But again, this is rather wordy and robs the DSL of its
ability to concisely express a particular semantic intent, which is
the whole point of DSLs in the first place.

Another limitation of Python's ability to represent DSLs is that you
cannot override the assignment operator.

My idea is to alleviate these problems by giving DSLs some help at the
parsing level. By informing the parser that a given expression is a
DSL, we can relax some of the normal assumptions of the Python syntax,
and instead require the DSL interpreter to handle them, which allows
the DSL to have greater flexibility.

Two things we want to avoid: First, we want to avoid "programmable
syntax", meaning that we cannot define DSL-specific semantics at parse
time. The reason for this is that DSLs are generally defined in their
own specific imported modules, and the Python parser has no knowledge
of imports, and therefore has no access to the DSL's syntax rules.

Secondly, we want to avoid conflicts between multiple DSLs. So for
example, if I am using two different DSLs in the same module, we can't
have two different meanings for "and", for example.

The solution to both of these concerns is to say that the parser's
support for DSLs is completely generic - in other words, we can tell
the parser that a given expression is a DSL, but we can tell it
nothing specific about that *particular* DSL. Instead, it is up to the
DSL interpreter - which operates at runtime, not parse time - to take
up where the parser left off.

This is similar to PJE's suggestion of a way to "quote" Python syntax,
returning an AST. In such a scheme, the Python parser doesn't
completely compile the quoted expression to Python byte code, but
instead produces an AST (or AST-like data structure which might not
exactly be an AST), which is then interpreted by the DSL.

So far this all seems reasonable enough to my ear :) Next is the part
where I go mad. (As my friend Sara said to me recently, "I'm not a mad
scientist - I'm a mad natural philosopher.")

For purposes of dicussion I am going to introduce the token '??' (two
consecutive question marks) as the "DSL operator", in other words a
signal to the parser that a given expression is a DSL. I'm sure that
there could be a long and intense bikeshed discussion debating the
specific operator syntax, but for now I'll choose ?? since it's
unambiguously parsable.

The '??' operator can be used in one of three ways. First, you can
declare an entire expression as a DSL:

    a = ??(c + d)

The parser will take all of the tokens inside the parenthesized group
and return them as an AST, it will not attempt to generate code to
evaluate any of the operators within the group.

There's an extra bit of syntactic sugar: If the DSL expression is the
only argument to a function, you can combine the function call and the
DSL quote:

   func??(c+d)

Meaning "pass the AST for 'c + d' to the function 'func'". For most
DSLs, the 'func' will actually be the name of the DSL, so the typical
use pattern will be 'name_of_dsl??(expr)'.

The second use is to quote an identifier:

    a = ??c + d

Now, you would think that this would quote the variable 'c' and
attempt to add it to 'd', but this is not the way it works. Instead,
the ?? operator invokes a kind of operator overloading that happens at
the parser level, meaning that any attempt to create an expression
that operates on a quoted variable also gets quoted. This only affects
binary and unary operators, not functions calls, array references or
assignment. So in this case, the entire expression "??c + d" gets
quoted.

The third way a DSL operator can be used is to quote an operator:

    a = c ??= d

Again, this allows us to generate the AST "c = d" and assign it to the
variable "a".

(I recognize that these last two syntactical variations would make the
interpretation of regular Python ASTs significantly more complex,
which is a serious drawback. Part of the motivation for both of these
variations is to deal with the 'and', 'or' and '=' problem, by
overriding the Python compiler's normal treatment of these operators.
However, it may be that the first form alone is sufficient, which
would be relatively easy to handle in the Python compiler I think -
you would simply have the AST.c module look for a DSL operator, and
don't call the normal AST processing steps for arguments to that
operator.)

Some examples of DSLs using the operator:

An SQL query as a DSL:

   query = SQL??(name == "John" and age > 25)
   result_iter = query.execute()

Syntax for an algebraic solver:

   @arity??(x + x)
   def reduce(x):
       return 2 * x

   @arity??(x + 0)
   def reduce(x):
       return x

   @arity??(x * 0)
   def reduce(x):
       return 0

   @arity??(x * 1)
   def reduce(x):
       return x

   @arity??(x * a + y * a)
   def reduce(x):
       return (x+y)*a

   # (and so on for all the other reduce() rules)

   print(reduce??((a+b*3) + (a+b*3)))
   >>> a*2+b*6

(In the above, we define a function 'reduce' that is similar to a
generic function, except that the dispatch logic is based on the form
of the AST being passed as an argument, using a one-way unification
algorithm to do argument pattern matching.)

I realize that there are some aspects of this proposal that may seem
silly, and that's true, and the only reason I'm even suggesting this
at all is that much of what I'm writing here is already been suggested
or at least implied by previous discussions.

Flame on!

-- 
-- Talin