invalid-token syntax hook (was Re: Hack request: rational numbers)

Mon Jan 29 05:36:11 EST 2001

[Tim]
> That said, I believe your only hope of getting new literal syntax is to
> embed it in a general proposal. ...

[Alex Martelli]
> What I was thinking, too.  So what about the "invalid-token syntax
> error hook" idea -- whenever the compiler is about to raise a
> SyntaxError because of an invalid token ('3r','@@x',etc), if
> sys.invalidTokenHook (or something of that ilk) is set to some
> callable, that callable is invoked, passed the invalid-token string,
> and given a chance to either return None (confirming that, yes, it
> IS a syntax error) or a tuple (possibly empty or singleton) of tokens
> to be used instead.

I think there are technical and philosophical reasons that won't fly.
Primarily, Guido would be appalled by any scheme that can take an arbitrary
sequence of characters and let people call that "Python".  Technically,
schemes that rely on compiler internals probably can't be extended to all
the tools I mentioned before (e.g., *lots* of tools "parse" Python via
half-assed regexp tricks now, but with the right half of the ass needed to
get the pieces they're looking for).

The compiler internals are dicey too.  For example, look closely at "3r".
That's not an invalid token today, it's a sequence of two valid tokens.
Python doesn't complain about 3r at the lexer level, it complains after
failing to find any  *grammar* production matching the "3" "r" token
sequence.  Note that

>>> 3or 2
3
>>>

is legit Python today (just making concrete that
    digit letter+
is not an error at the tokenization level).  Similarly, "@@x" is a sequence
of two illegal characters followed by a legit token.  If Python got hooked,
it would complain about the first "@" and that's that -- it would not have
even gotten around to reading the following "@x" by then.

> Not sure how (if?) a module specifies what hook it wants to
> be set during its own compilation -- perhaps it can't, in which
> case a module whose source contains syntactically invalid
> tokens can only be compiled after 'something else' has set
> the (global?) hook 'appropriately'.
>
> No, doesn't look like a necessarily great idea to me, either.

I expect that while the technical problems would prove solvable given
extraordinary effort (i.e., fat chance <wink>), the philosophical objections
would be fatal anyway.

> It allows a very modest amount of syntax extension (token
> level only -- really nothing more than the sugariest kind)
> and modules using it become not-understandable unless
> what hook they use is known/documented.  At least it does
> not allow _altering_ syntax in general (as only currently
> invalid tokens would ever been passed to the hook), but,
> it _would_ make it harder or impossibly to extent this
> level of syntax in the future (a currently-invalid token could
> not become valid in the future without risking a breakage
> to some existing hook).
>
> Consider it a strawman just to try and fathom if there IS
> some halfway-good general idea for extensibility...

We *might* get somewhere being even less ambitious.  Perhaps by extending
Python's grammar in a *fixed* way, so that every tool can learn to deal with
it via "cheap tricks", and no compile-time hooks are needed to decide
whether a piece of text *is* Python code.  The simplest thing I can think of
would be to allow any letter as a prefix to a string (instead of just the
[uUrR] allowed today), and ditto for a suffix on "a number" (instead of just
[lL] today).

But screw strings, since everyone who ever brings this up complains they
hate typing strings anyway (e.g., Fixed("3.45") and Rational("3/4") are
always rejected by people who want fixed-point decimal and rational types).
That leaves numbers, and

   rat = 3/4r

to divide int 3 by Rational 4, or

   my_base10_float = 3.45d

wouldn't cause much trouble at compile-time.  Presumably the earlier

>>> 3or 2

would become illegal (since it's now two adjacent numeric tokens), but I
don't think Guido would mind that (e.g., IDLE doesn't colorize it correctly
today, and I believe that's intentional).

The compiler could simply generate code to call a new function, say,

sys.user_literal(basestring, tags)

so that

    rat = 3/4r

would compile "as if"

    rat = 3 / sys.user_literal("4", "r")

had been written, and

    my_base10_float = 3.45d

as if

    my_base10_float = sys.user_literal("3.45", "d")

Then the user would be responsible for plugging in a suitable user_literal
function at runtime, before any module using the feature got imported.

Damned cute:  If I decorated all my floating-point literals with a "d"
suffix, then I could run an algorithm once with a hook that converted the
literals to builtin floats, and then change the hook to convert to an
emulation of floats with twice the precision and run it again.  That's a
wonderful way to get a quick feel for whether native float precision is
suffering catastrophic numeric errors.  Or run it again changing the hook to
convert them to rationals, and get an exact result (if the operations were
just simple arithmetic).  Heck, the body of a function could dynamically set
the hook based on the type of an argument, giving the effect of truly
polymorphic literals.  Could be very useful, to an ever decreasing audience
<0.5 wink>.

That's the best I can do in 15 minutes, but afraid that's all the time I've
got now ...

do-the-simplest-thing-that-could-possibly-get-rejected<wink>-ly y'rs  - tim