Where can I find a lexical spec of python?

Wed Sep 21 12:55:45 EDT 2011

On 21/09/11 18:33, 程劭非 wrote:
> Thanks Thomas.
> I've read the document http://docs.python.org/py3k/reference/lexical_analysis.html 
> 
> but I worried it might leak some language features like "tab magic".
> 
> For I'm working on a parser with JavaScript I need a more strictly defined spec. 
> 
> Currently I have a highlighter here ->http://shaofei.name/python/PyHighlighter.html
> (Also the lexer  http://shaofei.name/python/PyLexer.html)
> 
> As you can see, I just make its behavior align with CPython, but I'm not sure what the real python lexical grammar is like.
> 
> Does anyone know if there is a lexical grammar spec like other languages(e.g. http://bclary.com/2004/11/07/#annex-a)?

I believe the language documentation on docs.python.org is all the
documentation of the language there is. It may not be completely formal,
and in parts it concentrates not on the actual rules but on the original
implementation, but, as far as I can tell, it tells you everything you
need to know to write a new parser for the Python language, without any
ambiguity.

You appear to be anxious about implementing the indentation mechanism
correctly. The language documentation describes a behaviour precisely.
What is the problem?

Thomas

> 
> Please help me. Thanks a lot.
> 在 2011-09-21 19:41:33，"Thomas Jollans" <t at jollybox.de> 写道：
>> On 21/09/11 11:44, 程劭非 wrote:
>>> Hi, everyone, 
>>> I've found there was several tokens used in python's
>>> grammar(http://docs.python.org/reference/grammar.html) but I didn't see
>>> their definition anywhere.  The tokens listed here: 
>>
>> They should be documented in
>> http://docs.python.org/py3k/reference/lexical_analysis.html - though
>> apparently not using these exact terms.
>>
>>> NEWLINE
>> Trivial: U+000A
>>
>>> ENDMARKER
>> End of file.
>>
>>> NAME
>> documented as "identifier" in 2.3
>>
>>> INDENT
>>> DEDENT
>> Documented in 2.1.8.
>>
>>> NUMBER
>> Documented in 2.4.3 - 2.4.6
>>
>>> STRING
>> Documented in 2.4.2
>>
>>> I've got some infomations from the source
>>> code(http://svn.python.org/projects/python/trunk/Parser/tokenizer.c) but
>>> I'm not sure which feature is only for this specified implementaion.  (I
>>> saw tabstop could be modified with comments using "tab-width:",
>>> ":tabstop=", ":ts=" or "set tabsize=", is this feature really in spec?)
>>
>> That sounds like a legacy feature that is no longer used. Somebody
>> familiar with the early history of Python might be able to shed more
>> light on the situation. It is inconsisten with the spec (section 2.1.8):
>>
>> """
>> Indentation is rejected as inconsistent if a source file mixes tabs and
>> spaces in a way that makes the meaning dependent on the worth of a tab
>> in spaces; a TabError is raised in that case.
>> """
>>
>> - Thomas
>> -- 
>> http://mail.python.org/mailman/listinfo/python-list
>