[Python-Dev] PEP 498: Literal String Interpolation is ready for pronouncement

Sun Sep 6 02:45:07 CEST 2015

On 9/5/2015 7:12 PM, Nathaniel Smith wrote:
> On Sat, Sep 5, 2015 at 1:00 PM, Eric V. Smith <eric at trueblade.com> wrote:
>> On 9/5/2015 3:23 PM, Nathaniel Smith wrote:
>>> On Sep 5, 2015 11:32 AM, "Eric V. Smith" <eric at trueblade.com
>>> <mailto:eric at trueblade.com>> wrote:
>>>> Ignore the part about non-doubled '}'. The actual description is:
>>>>
>>>> To find the end of an expression, it looks for a '!', ':', or '}', not
>>>> inside of a string or (), [], or {}. There's a special case for '!=' so
>>>> the bang isn't seen as ending the expression.
>>>
>>> Sounds like you're reimplementing a lot of the lexer... I guess that's
>>> doable, but how confident are you that your definition of "inside a
>>> string" matches the original in all corner cases?
>>
>> Well, this is 35 lines of code (including comments), and it's much
>> simpler than a lexer (in the sense of "something that generates
>> tokens"). So I don't think I'm reimplementing a lot of the lexer.
>>
>> However, your point is valid: if I don't do the same thing the lexer
>> would do, I could either prematurely find the end of an expression, or
>> look too far. In either case, when I call ast.parse() I'll get a syntax
>> error, and/or I'll get an error when parsing/lexing the remainder of the
>> string.
>>
>> But it's not like I have to agree with the lexer: no larger error will
>> occur if I get it wrong. Everything is confined to a single f-string,
>> since I've already used the lexer to find the f-string in its entirety.
>> I only need to make sure the users understand how expressions are
>> extracted from f-strings.
>>
>> I did look at using the actual lexer (Parser/tokenizer.c) to do this,
>> but it would require a large amount of surgery. I think it's overkill
>> for this task.
>>
>> So far, I've tested it enough to have reasonable confidence that it's
>> correct. But the implementation could always be swapped out for an
>> improved version. I'm certainly open to that, if we find cases that the
>> simple scanner can't deal with.
>>
>>> In any case the abstract language definition part should be phrased in
>>> terms of the python lexer -- the expression ends when you encounter the
>>> first } *token* that is not nested inside () [] {} *tokens*, and then
>>> you can implement it however makes sense...
>>
>> I'm not sure that's an improvement on Guido's description when you're
>> trying to explain it to a user. But when time comes to write the
>> documentation, we can discuss it then.
> 
> I'm not talking about end-user documentation, I'm talking about the
> formal specification, like in the Python Language Reference.

I think the formal specification can talk about scanning a string
looking for the end of an expression without discussing tokens. Just as
https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
uses an EBNF sort of language.

The implementation is not tokenizing anything: it's just trying to find
the end of the string in order to pass it to ast.parse(). That task is
significantly easier than tokenizing.

> I'm pretty sure that just calling the tokenizer will be easier for
> Cython or PyPy than implementing a special purpose scanner :-)

I sincerely doubt that, but I'd be curious how they implemented
_string.formatter_parser(), which is extremely close in functionality.
Sadly, just not close enough to be reusable for this.

But I'm not going to argue about implementation when the PEP hasn't been
accepted yet. Especially for code I'm not going to write in Cython and PyPy!

>>> (This is then the same rule that patsy uses to find the end of python
>>> expressions embedded inside patsy formula strings: patsy.readthedocs.org
>>> <http://patsy.readthedocs.org>)
>>
>> I don't see where patsy looks for expressions in parts of strings. Let
>> me know if I'm missing it.
> 
> Patsy parses strings like
> 
>    "np.sin(a + b) + c"
> 
> using a grammar that supports some basic arithmetic-like infix
> operations (+, *, parentheses, etc.), and in which the atoms are
> arbitrary Python expressions. So the above string is parsed into a
> patsy-AST that looks something like:
> 
>   Add(PyExpr("np.sin(a + b)"), PyExpr("c"))
> 
> The rule it uses to do this is that it uses the Python tokenizer,
> counts nesting of () [] {}, and when it sees a valid unnested patsy
> operator, then that's the end of the embedded expression:
> 
>   https://github.com/pydata/patsy/blob/master/patsy/parse_formula.py#L37
> 
> Not tremendously relevant, but that's why I've thought this through before :-)

For the actual parsing of the expressions, I use the equivalent of
ast.parse(). So that phase of the processing I'm not worried about.

Eric.