What to use for finding as many syntax errors as possible.

Tue Oct 11 03:10:51 EDT 2022

Thanks for a rather detailed explanation of some of what we have been
discussing, Chris. The overall outline is about what I assumed was there but
some of the details were, to put it politely, fuzzy.

I see resemblances to something like how a web page is loaded and operated.
I mean very different but at some level not so much.

I mean a typical web page is read in as HTML with various keyword regions
expected such as <BODY> ... </BODY> or <DIV ...> ... </DIV> with things
often cleanly nested in others. The browser makes nodes galore in some kind
of tree format with an assortment of objects whose attributes or methods
represent aspects of what it sees. The resulting treelike structure has
names like DOM.

To a certain approximation, this tree starts a certain way but is regularly
being manipulated (or perhaps a copy is) as it regularly is looked at to see
how to display it on the screen at the moment based on the current tree
contents and another set of rules in Cascading Style Sheets. But bits and
pieces of JavaScript are also embedded or imported that can read aspects of
the tree (and more) and modify the contents and arrange for all kinds of
asynchronous events when bits of code are invoked such as when you click a
button or hover or when an image finishes loading or every 100 milliseconds.
It can insert new objects into the DOM too. And of course there can be
interactions with restricted local storage as well as with servers and code
running there.

It is quite a mess but in some ways I see analogies. Your program reads a
stream of data and looks for tokens and eventually turns things into a tree
of sorts that represents relationships to a point. Additional structures
eventually happen at run time that let you store collections of references
to variables such as environments or namespaces and the program derived from
the trees makes changes as it goes and in a language like Python can even
possibly change the running program in some ways.

These are not at all the same thing but share a certain set of ideas and
methods and can be very powerful as things interact. In the web case, the
CSS may search for regions with some class or ID or that are the third
element of a bullet list and more, using powerful tools like jQuery, and
make changes. A CSS rule that previously ignored some region as not having a
particular class, might start including it after a JavaScript segment is
aroused while waiting on an event listener for say a mouse hovering over an
area and then changes that part of the DOM (like a node) to be in that
class. Suddenly the area on your screen changes background or whatever the
CSS now dictates. We have multiple systems written in an assortment of
"languages" that complement each other. Some running programs, especially
ones that use asynchronous methods like threads or callbacks on events, such
as a GUI, can effectively do similar things. 

In effect the errors in the web situation have such analogies too as in what
happens if a region of HTML is not well-formed or uses a keyword not
recognized. This becomes even more interesting in XML where anything can be
a keyword and you often need other kinds of files (often also in ML) to
define what the XML can be like and what restrictions it may have such as
can a <BOOK> have multiple authors but only one optional publication date
and so on. It can be fascinating and highly technical. So I am up for a
challenge of studying anything from early compilers for languages of my
youth to more recent ways including some like what you show.

I have time to kill and this might be more fun than other things, for a
while.

There was a guy around a few years ago who suggested he would create a
system where you could create a series of some kind of configuration files
for ANY language and his system would them compile or run programs for each
and every such language? Was that on this forum? What ever happened to him?

But although what he promised seemed a bit too much, I can see from your
comments below how in some ways a limited amount of that might be done for
some subset of languages which can be parsed and manipulated as described. 

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On
Behalf Of Chris Angelico
Sent: Monday, October 10, 2022 11:55 PM
To: python-list at python.org
Subject: Re: What to use for finding as many syntax errors as possible.

On Tue, 11 Oct 2022 at 14:26, <avi.e.gross at gmail.com> wrote:
>
> I stand corrected Chris, and others, as I pay the sin tax.
>
> Yes, there are many kinds of errors that logically fall into different 
> categories or phases of evaluation of a program and some can be 
> determined by a more static analysis almost on a line by line (or 
> "statement" or "expression", ...)  basis and others need to sort of 
> simulate some things and look back and forth to detect possible 
> incompatibilities and yet others can only be detected at run time and 
> likely way more categories depending on the language.
>
> But when I run the Python interpreter on code, aren't many such phases 
> done interleaved and at once as various segments of code are parsed 
> and examined and perhaps compiled into block code and eventually executed?

Hmm, depends what you mean. Broadly speaking, here's how it goes:

0) Early pre-parse steps that don't really matter to most programs, like
checking character set. We'll ignore these.
1) Tokenize the text of the program into a sequence of
potentially-meaningful units.
2) Parse those tokens into some sort of meaningful "sentence".
3) Compile the syntax tree into actual code.
4) Run that code.

Example:
>>> code = """def f():
...     print("Hello, world", 1>=2)
...     print(Ellipsis, ...)
...     return True
... """
>>>

In step 1, all that happens is that a stream of characters (or bytes,
depending on your point of view) gets broken up into units.

>>> for t in tokenize.tokenize(iter(code.encode().split(b"\n")).__next__):
...     print(tokenize.tok_name[t.exact_type], t.string)

It's pretty spammy, but you can see how the compiler sees the text.
Note that, at this stage, there's no real difference between the NAME "def"
and the NAME "print" - there are no language keywords yet.
Basically, all you're doing is figuring out punctuation and stuff.

Step 2 is what we'd normally consider "parsing". (It may well happen
concurrently and interleaved with tokenizing, and I'm giving a simplified
and conceptualized pipeline here, but this is broadly what Python does.)
This compares the stream of tokens to the grammar of a Python program and
attempts to figure out what it means. At this point, the linear stream turns
into a recursive syntax tree, but it's still very abstract.

>>> import ast
>>> ast.dump(ast.parse(code))
"Module(body=[FunctionDef(name='f', args=arguments(posonlyargs=[], args=[],
kwonlyargs=[], kw_defaults=[], defaults=[]),
body=[Expr(value=Call(func=Name(id='print', ctx=Load()),
args=[Constant(value='Hello, world'), Compare(left=Constant(value=1),
ops=[GtE()], comparators=[Constant(value=2)])], keywords=[])),
Expr(value=Call(func=Name(id='print', ctx=Load()), args=[Name(id='Ellipsis',
ctx=Load()), Constant(value=Ellipsis)], keywords=[])),
Return(value=Constant(value=True))],
decorator_list=[])], type_ignores=[])"

(Side point: I would rather like to be able to
pprint.pprint(ast.parse(code)) but that isn't a thing, at least not
currently.)

This is where the vast majority of SyntaxErrors come from. Your code is a
sequence of tokens, but those tokens don't mean anything. It doesn't make
sense to say "print(def f[return)]" even though that'd tokenize just fine.
The trouble with the notion of "keeping going after finding an error" is
that, when you find an error, there are almost always multiple possible ways
that this COULD have been interpreted differently. It's as likely to give
nonsense results as actually useful ones.

(Note that, in contrast to the tokenization stage, this version
distinguishes between the different types of word. The "def" has resulted in
a FunctionDef node, the "print" is a Name lookup, and both "..." and "True"
have now become Constant nodes - previously, "..."
was a special Ellipsis token, but "True" was just a NAME.)

Step 3: the abstract syntax tree gets parsed into actual runnable code. This
is where that small handful of other SyntaxErrors come from. With these
errors, you absolutely _could_ carry on and report multiple; but it's not
very likely that there'll actually *be* more than one of them in a file.
Here's some perfectly valid AST parsing:

>>> ast.dump(ast.parse("from __future__ import the_past"))
"Module(body=[ImportFrom(module='__future__',
names=[alias(name='the_past')], level=0)], type_ignores=[])"
>>> ast.dump(ast.parse("from __future__ import braces"))
"Module(body=[ImportFrom(module='__future__',
names=[alias(name='braces')], level=0)], type_ignores=[])"
>>> ast.dump(ast.parse("def f():\n\tdef g():\n\t\tnonlocal x\n"))
"Module(body=[FunctionDef(name='f', args=arguments(posonlyargs=[], args=[],
kwonlyargs=[], kw_defaults=[], defaults=[]), body=[FunctionDef(name='g',
args=arguments(posonlyargs=[], args=[], kwonlyargs=[], kw_defaults=[],
defaults=[]), body=[Nonlocal(names=['x'])], decorator_list=[])],
decorator_list=[])], type_ignores=[])"

If you were to try to actually compile those to bytecode, they would fail:

>>> compile(ast.parse("from __future__ import braces"), "-", "exec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "-", line 1
SyntaxError: not a chance

And finally, step 4 is actually running the compiled bytecode. Any errors
that happen at THIS stage are going to be run-time errors, not syntax errors
(a SyntaxError raised at run time would be from compiling other code).

> So is the OP asking for something other than a Python Interpreter that 
> normally halts after some kind of error? Tools like a linter may 
> indeed fit that mold.

Yes, but linters are still going to go through the same process laid out
above. So if you have a huge pile of code that misuses "await" in non-async
functions, sure! Maybe a linter could half-compile the code, then probe it
repeatedly until it gets past everything. That's not exactly a common case,
though. More likely, you'll have parsing errors, and the only way to "move
past" a parsing error is to guess at what token should be added or removed
to make it "kinda work".

Alternatively, you'll get some kind of messy heuristics to try to restart
parsing part way down, but that's pretty imperfect too.

> This may limit some of the objections of when an error makes it hard 
> for the parser to find some recovery point to continue from as no code 
> is being run and no harmful side effects happen by continuing just an
analysis.

It's pretty straight-forward to ensure that no code is run - just compile it
without running it. It's still possible to attack the compiler itself, but
far less concerning than running arbitrary code.
Attacks on the compiler are usually deliberate; code you don't want to run
yet might be a perfectly reasonable call to os.unlink()...

> Time to go read some books about modern ways to evaluate a language 
> based on more mathematical rules including more precisely what is syntax
versus ...
>
> Suggestions?
>

I'd recommend looking at Python's compile() function, the ast and tokenizer
modules, and everything that they point to.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list