syntax difference
Ben Bacarisse
ben.usenet at bsb.me.uk
Mon Jun 18 12:25:10 EDT 2018
Bart <bc at freeuk.com> writes:
> On 18/06/2018 11:45, Chris Angelico wrote:
>> On Mon, Jun 18, 2018 at 8:33 PM, Bart <bc at freeuk.com> wrote:
>
>
>>> You're right in that neither task is that trivial.
>>>
>>> I can remove comments by writing a tokeniser which scans Python source and
>>> re-outputs tokens one at a time. Such a tokeniser normally ignores comments.
>>>
>>> But to remove type hints, a deeper understanding of the input is needed. I
>>> would need a parser rather than a tokeniser. So it is harder.
>>
>> They would actually both end up the same. To properly recognize
>> comments, you need to understand enough syntax to recognize them. To
>> properly recognize type hints, you need to understand enough syntax to
>> recognize them. And in both cases, you need to NOT discard important
>> information like consecutive whitespace.
>
> No. If syntax is defined on top of tokens, then at the token level,
> you don't need to know any syntax. The process that scans characters
> looking for the next token, will usually discard comments. Job done.
You don't even need to scan for tokens other than strings. From what I
read in the documentation a simple scanner like this would do the trick:
%option noyywrap
%x sqstr dqstr sqtstr dqtstr
%%
\' ECHO; BEGIN(sqstr);
\" ECHO; BEGIN(dqstr);
\'\'\' ECHO; BEGIN(dqtstr);
\"\"\" ECHO; BEGIN(dqtstr);
<dqstr>\" |
<sqstr>\' |
<sqtstr>\'\'\' |
<dqtstr>\"\"\" ECHO; BEGIN(INITIAL);
<sqstr>\\\' |
<dqstr>\\\" |
<sqstr,dqstr,sqtstr,dqtstr,INITIAL>. ECHO;
#.*
%%
int main(void) { yylex(); }
and it's only this long because there are four kinds of string. Not
being a Python expert, there may be some corner case errors. And really
there are comments that should not be removed such as #! on line 1 and
encoding declarations, but they would just need another line or two.
--
Ben.
More information about the Python-list
mailing list