What to use for finding as many syntax errors as possible.

Chris Angelico rosuav at gmail.com
Mon Oct 10 23:55:12 EDT 2022


On Tue, 11 Oct 2022 at 14:26, <avi.e.gross at gmail.com> wrote:
>
> I stand corrected Chris, and others, as I pay the sin tax.
>
> Yes, there are many kinds of errors that logically fall into different
> categories or phases of evaluation of a program and some can be determined
> by a more static analysis almost on a line by line (or "statement" or
> "expression", ...)  basis and others need to sort of simulate some things
> and look back and forth to detect possible incompatibilities and yet others
> can only be detected at run time and likely way more categories depending on
> the language.
>
> But when I run the Python interpreter on code, aren't many such phases done
> interleaved and at once as various segments of code are parsed and examined
> and perhaps compiled into block code and eventually executed?

Hmm, depends what you mean. Broadly speaking, here's how it goes:

0) Early pre-parse steps that don't really matter to most programs,
like checking character set. We'll ignore these.
1) Tokenize the text of the program into a sequence of
potentially-meaningful units.
2) Parse those tokens into some sort of meaningful "sentence".
3) Compile the syntax tree into actual code.
4) Run that code.

Example:
>>> code = """def f():
...     print("Hello, world", 1>=2)
...     print(Ellipsis, ...)
...     return True
... """
>>>

In step 1, all that happens is that a stream of characters (or bytes,
depending on your point of view) gets broken up into units.

>>> for t in tokenize.tokenize(iter(code.encode().split(b"\n")).__next__):
...     print(tokenize.tok_name[t.exact_type], t.string)

It's pretty spammy, but you can see how the compiler sees the text.
Note that, at this stage, there's no real difference between the NAME
"def" and the NAME "print" - there are no language keywords yet.
Basically, all you're doing is figuring out punctuation and stuff.

Step 2 is what we'd normally consider "parsing". (It may well happen
concurrently and interleaved with tokenizing, and I'm giving a
simplified and conceptualized pipeline here, but this is broadly what
Python does.) This compares the stream of tokens to the grammar of a
Python program and attempts to figure out what it means. At this
point, the linear stream turns into a recursive syntax tree, but it's
still very abstract.

>>> import ast
>>> ast.dump(ast.parse(code))
"Module(body=[FunctionDef(name='f', args=arguments(posonlyargs=[],
args=[], kwonlyargs=[], kw_defaults=[], defaults=[]),
body=[Expr(value=Call(func=Name(id='print', ctx=Load()),
args=[Constant(value='Hello, world'), Compare(left=Constant(value=1),
ops=[GtE()], comparators=[Constant(value=2)])], keywords=[])),
Expr(value=Call(func=Name(id='print', ctx=Load()),
args=[Name(id='Ellipsis', ctx=Load()), Constant(value=Ellipsis)],
keywords=[])), Return(value=Constant(value=True))],
decorator_list=[])], type_ignores=[])"

(Side point: I would rather like to be able to
pprint.pprint(ast.parse(code)) but that isn't a thing, at least not
currently.)

This is where the vast majority of SyntaxErrors come from. Your code
is a sequence of tokens, but those tokens don't mean anything. It
doesn't make sense to say "print(def f[return)]" even though that'd
tokenize just fine. The trouble with the notion of "keeping going
after finding an error" is that, when you find an error, there are
almost always multiple possible ways that this COULD have been
interpreted differently. It's as likely to give nonsense results as
actually useful ones.

(Note that, in contrast to the tokenization stage, this version
distinguishes between the different types of word. The "def" has
resulted in a FunctionDef node, the "print" is a Name lookup, and both
"..." and "True" have now become Constant nodes - previously, "..."
was a special Ellipsis token, but "True" was just a NAME.)

Step 3: the abstract syntax tree gets parsed into actual runnable
code. This is where that small handful of other SyntaxErrors come
from. With these errors, you absolutely _could_ carry on and report
multiple; but it's not very likely that there'll actually *be* more
than one of them in a file. Here's some perfectly valid AST parsing:

>>> ast.dump(ast.parse("from __future__ import the_past"))
"Module(body=[ImportFrom(module='__future__',
names=[alias(name='the_past')], level=0)], type_ignores=[])"
>>> ast.dump(ast.parse("from __future__ import braces"))
"Module(body=[ImportFrom(module='__future__',
names=[alias(name='braces')], level=0)], type_ignores=[])"
>>> ast.dump(ast.parse("def f():\n\tdef g():\n\t\tnonlocal x\n"))
"Module(body=[FunctionDef(name='f', args=arguments(posonlyargs=[],
args=[], kwonlyargs=[], kw_defaults=[], defaults=[]),
body=[FunctionDef(name='g', args=arguments(posonlyargs=[], args=[],
kwonlyargs=[], kw_defaults=[], defaults=[]),
body=[Nonlocal(names=['x'])], decorator_list=[])],
decorator_list=[])], type_ignores=[])"

If you were to try to actually compile those to bytecode, they would fail:

>>> compile(ast.parse("from __future__ import braces"), "-", "exec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "-", line 1
SyntaxError: not a chance

And finally, step 4 is actually running the compiled bytecode. Any
errors that happen at THIS stage are going to be run-time errors, not
syntax errors (a SyntaxError raised at run time would be from
compiling other code).

> So is the OP asking for something other than a Python Interpreter that
> normally halts after some kind of error? Tools like a linter may indeed fit
> that mold.

Yes, but linters are still going to go through the same process laid
out above. So if you have a huge pile of code that misuses "await" in
non-async functions, sure! Maybe a linter could half-compile the code,
then probe it repeatedly until it gets past everything. That's not
exactly a common case, though. More likely, you'll have parsing
errors, and the only way to "move past" a parsing error is to guess at
what token should be added or removed to make it "kinda work".

Alternatively, you'll get some kind of messy heuristics to try to
restart parsing part way down, but that's pretty imperfect too.

> This may limit some of the objections of when an error makes it hard for the
> parser to find some recovery point to continue from as no code is being run
> and no harmful side effects happen by continuing just an analysis.

It's pretty straight-forward to ensure that no code is run - just
compile it without running it. It's still possible to attack the
compiler itself, but far less concerning than running arbitrary code.
Attacks on the compiler are usually deliberate; code you don't want to
run yet might be a perfectly reasonable call to os.unlink()...

> Time to go read some books about modern ways to evaluate a language based on
> more mathematical rules including more precisely what is syntax versus ...
>
> Suggestions?
>

I'd recommend looking at Python's compile() function, the ast and
tokenizer modules, and everything that they point to.

ChrisA


More information about the Python-list mailing list