Why is it impossible to create a compiler than can compile Python to machinecode like C?

Thu Feb 28 23:27:28 EST 2013

On Thu, 28 Feb 2013 12:25:07 -0800, kramer65 wrote:

> Hello,
> 
> I'm using Python for a while now and I love it. There is just one thing
> I cannot understand. There are compilers for languages like C and C++.
> why is it impossible to create a compiler that can compile Python code
> to machinecode?

Your assumption is incorrect. You can compile Python to machine-code, at 
least sometimes. It is quite tricky, for various reasons, but it can be 
done, at various levels of efficiency.

One of the oldest such projects was Psyco, which was a Just-In-Time 
compiler for Python. When Psyco was running, it would detect at run time 
that you were doing calculations on (say) standard ints, compile on the 
fly a machine-code function to perform those calculations, and execute 
it. Psyco has more or less been made obsolete by PyPy, which does the 
same thing only even more so.

http://en.wikipedia.org/wiki/Psyco
http://en.wikipedia.org/wiki/PyPy

> My reasoning is as follows:
> When GCC compiles a program written in C++, it simply takes that code
> and decides what instructions that would mean for the computer's
> hardware. What does the CPU need to do, what does the memory need to
> remember, etc. etc. If you can create this machinecode from C++, then I
> would suspect that it should also be possible to do this (without a
> C-step in between) for programs written in Python.

In principle, yes, but in practice it's quite hard, simply because Python 
does so much more at runtime than C++ (in general).

Take an expression like:

x = a + b

In C++, the compiler knows what kind of data a and b are, what kind of 
data x is supposed to be. They are often low-level machine types like 
int32 or similar, which the CPU can add directly (or at least, the 
compiler can fake it). Even if the variables are high-level objects, the 
compiler can usually make many safe assumptions about what methods will 
be called, and can compile instructions something like this pseudo-code:

10 get the int64 at location 12348   # "a"
20 get the int64 at location 13872   # "b"
30 jump to the function at location 93788  # add two int64s
40 store the result at location 59332  # "x"

which is fast and efficient because most of the hard work is done at 
compile time. But it's also quite restrictive, because you can't change
code on the fly, create new types or functions, etc. (Or, where you can,
then you lose some of the advantages of C++ and end up with something
like Python but with worse syntax.)

In Python, you don't know what a and b are until runtime. They could be 
ints, or lists, or strings, or anything. The + operator could call a 
custom __add__ method, or a __radd__ method, from some arbitrary class. 
Because nearly everything is dynamic, the Python compiler cannot safely 
make many assumptions about the code at compile time. So you end up with 
code like this:

10 search for the name "a" and take note of it
20 search for the name "b" and take note of it
30 decide whether to call a.__add__ or b.__radd__
40 call the appropriate method
60 bind the result to the name "x"

You can get an idea of what Python actually does by disassembling the 
byte code into pseudo-assembly language:

py> code = compile("x = a + b", '', 'single')
py> from dis import dis
py> dis(code)
  1           0 LOAD_NAME                0 (a) 
              3 LOAD_NAME                1 (b) 
              6 BINARY_ADD           
              7 STORE_NAME               2 (x) 
             10 LOAD_CONST               0 (None) 
             13 RETURN_VALUE         

Nevertheless, PyPy can often speed up Python code significantly, 
sometimes to the speed of C or even faster.

http://morepypy.blogspot.com.au/2011/02/pypy-faster-than-c-on-carefully-crafted.html

http://morepypy.blogspot.com.au/2011/08/pypy-is-faster-than-c-again-string.html

-- 
Steven