syntax difference

Mon Jun 18 07:33:08 EDT 2018

On Mon, Jun 18, 2018 at 9:16 PM, Bart <bc at freeuk.com> wrote:
> On 18/06/2018 11:45, Chris Angelico wrote:
>>
>> On Mon, Jun 18, 2018 at 8:33 PM, Bart <bc at freeuk.com> wrote:
>
>
>
>>> You're right in that neither task is that trivial.
>>>
>>> I can remove comments by writing a tokeniser which scans Python source
>>> and
>>> re-outputs tokens one at a time. Such a tokeniser normally ignores
>>> comments.
>>>
>>> But to remove type hints, a deeper understanding of the input is needed.
>>> I
>>> would need a parser rather than a tokeniser. So it is harder.
>>
>>
>> They would actually both end up the same. To properly recognize
>> comments, you need to understand enough syntax to recognize them. To
>> properly recognize type hints, you need to understand enough syntax to
>> recognize them. And in both cases, you need to NOT discard important
>> information like consecutive whitespace.
>
>
> No. If syntax is defined on top of tokens, then at the token level, you
> don't need to know any syntax. The process that scans characters looking for
> the next token, will usually discard comments. Job done.

And it also will usually discard formatting (consecutive whitespace,
etc). So unless you're okay with reconstructing
functionally-equivalent code, rather than actually preserving the
original code, you cannot merely tokenize. You have to use a special
form of tokenization that actually keeps all that.

> It is very different for type-hints as you will need to properly parse the
> source code.
>
> As a simpler example, if the task was the eliminate the "+" symbol, that
> would be one kind of token; it would just be skipped when encountered. But
> if the requirement to eliminate only unary "+", and leave binary "+", then
> that can't be done at tokeniser level; it will not know the context.

Right. You can fairly easily reconstruct code that uses a single
newline for any NEWLINE token, and a single space in any location
where whitespace makes sense. It's not so easy to correctly
reconstruct "x*y + a*b" with the original spacing.

> What will those look like? If copyright/licence comments have their own
> specific syntax, then they just become another token which has to be
> recognised.

If they have specific syntax, they're not comments, are they?

> The main complication I can see is that, if this is really a one-time
> source-to-source translator so that you will be working with the result,
> then usually you will want to keep the comments.
>
> Then it is a question of more precisely defining the task that such a
> translator is to perform.

Right, exactly. So you need to do an actual smart parse, which - as
mentioned - is functionally equivalent whether you're stripping
comments or some lexical token.

ChrisA