[Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?

Thu May 17 21:51:56 EDT 2018

To answer Larry's question, there's an overwhelming number of different
options -- bytes/unicode, raw/cooked, and (in Py2) `from __future__ import
unicode_literals`. So it's easier to do the actual semantic conversion in a
later stage -- then the lexer only has to worry about hopping over
backslashes.

On Thu, May 17, 2018 at 3:38 PM, Eric V. Smith <eric at trueblade.com> wrote:

> On 5/17/2018 3:01 PM, Larry Hastings wrote:
>
>>
>>
>> I fed this into tokenize.tokenize():
>>
>>     b''' x = "\u1234" '''
>>
>> I was a bit surprised to see \Uxxxx in the output.  Particularly because
>> the output (t.string) was a *string* and not *bytes*.
>>
>
> For those (like me) who have no idea how to use tokenize.tokenize's wacky
> interface, the test code is:
>
> list(tokenize.tokenize(io.BytesIO(b''' x = "\u1234" ''').readline))
>
> Maybe I'm making a parade of my ignorance, but I assumed that string
>> literals were parsed by the parser--just like everything else is parsed by
>> the parser, hey it seems like a good place for it--and in particular that
>> the escape sequence substitutions would be done in the tokenizer.  Having
>> stared at it a little, I now detect a whiff of "this design solved a real
>> problem".  So... what was the problem, and how does this design solve it?
>>
>
> I assume the intent is to not throw away any information in the lexer, and
> give the parser full access to the original string. But that's just a guess.
>
> BTW, my use case is that I hoped to use CPython's tokenizer to parse some
>> Python-ish-looking text and handle double-quoted strings for me.
>> *Especially* all the escape sequences--leveraging all CPython's support for
>> funny things like \U{penguin}.  The current behavior of the tokenizer makes
>> me think it'd be easier to roll my own!
>>
>
> Can you feed the token text to the ast?
>
> >>> ast.literal_eval('"\u1234"')
> 'ሴ'
>
> Eric
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%
> 40python.org
>

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20180517/34482a72/attachment-0001.html>