Python and the need for speed

Tue Apr 11 04:19:31 EDT 2017

On Sun, 09 Apr 2017 19:05:35 +1000, Chris Angelico wrote:

> When people talk about making a restricted optimizable subset of Python,
> the implication (if not the explicit declaration) is that it's done
> strictly by removing, not by rewriting.

Maybe *some* people have a naive, if not foolish, idea that all you need 
to do to speed up CPython is to go through the C code and hit delete over 
any functions you don't want. But that's hardly a technique that is going 
to work: not only will you break features, but merely removing code 
doesn't magically make the rest of the code base faster.

At the very least, you have to re-write the code that depends on the 
features you have just removed.

If we're going to talk about speeding up Python, we ought to talk about 
*serious* approaches to the problem, not the musing of random, ignorant 
bloggers and other naive commentators. So let's look at what the experts 
have done:

PyPy: a complete re-write, with a completely different approach (JIT 
compilation).

WPython (alas, now more or less abandoned): a complete re-write of the 
virtual machine, using 16-bit "wordcodes" instead of 8 bit "bytecodes", 
nevertheless this enabled a lot of good optimizations:

http://repository.root-me.org/Programmation/Python/EN%20-%20Beyond%
20python%20bytecode.pdf

https://storage.googleapis.com/google-code-archive-downloads/v2/
code.google.com/wpython2/Cleanup%20and%20new%20optimizations%20in%
20WPython%201.1.pdf

Unladen Swallow: among other changes, added new, specialist byte-codes to 
speed up common operations.

http://qinsb.blogspot.com.au/2011/03/unladen-swallow-retrospective.html

Nuitka: a static optimizing compiler, and a complete re-write. Nuitka 
implements a number of compile-time optimizations while still keeping 
(claimed) 100% compatibility with CPython.

http://nuitka.net/doc/user-manual.html

FATPython adds compile-time name lookups, specialised functions, dead-
code elimination, constant folding and propagation to a CPython base. 

https://haypo.github.io/fat-python-status-janv12-2016.html

Cython: a *superset*, not subset, of Python.

I could continue, but you get the point. Some of these are experimental, 
like FATPython; some were production ready but abandoned because the 
community had little interest (wpython); some are actively maintained and 
used in production by many people (cython, PyPy).

The Python ecosystem is actually quite healthy, if you need to speed up 
code there are lots of solutions, and some of them are even good 
solutions. Nevertheless, it is interesting to discuss whether or not any 
of these features will go mainstream or make it into CPython.

But anyway... it's not sufficient to just delete features to speed up a 
language. You have to change implementation, and very likely add entirely 
new internal APIs for the compiler.

Creating a restricted subset of Python, and giving up compatibility with 
ordinary Python, is certainly an opinion. That's what Facebook did with 
their version of PHP. But its not the only approach.

> A couple more quotes from the article:
> 
>> It should be possible to define a subset of the Python language,
>> uninspiredly dubbed “TurboPython”, that excludes those features that
>> stand in the way of high-performance JIT execution (or compilation).
>> Not using some of these features coincides with good design practices,
>> so it doesn’t necessarily have to be all bad.
> ...
>> Since TurboPython is a subset of Python, it will also run on Python
>> interpreters, albeit slower.
> 
> Both of these imply that the standard library of TurboPython is
> *exactly* the standard library of CPython, 

No. It doesn't say anything about the standard library. Maybe TurboPython 
has no standard library at all. The library is independent of the 
interpreter, and until somebody actually puts the effort in of writing it 
(or copying it from one repo to another), we are equally justified in 
saying any of:

- TurboPython's std lib will have 100% functional compatibility 
  with Python, only the implementation may change;

- TurboPython has no std lib, it only has the built-in types and modules,
  if you want more, you can port it yourself;

- TurboPython's std lib is a subset of the Python std lib, just as
  TurboPython itself is a subset of Python;

- TurboPython's std lib is a MILLION TIMES BETTER than Python's.

Since TurboPython doesn't exist, any of those statements are equally true.

But in a *practical* sense, for our mythical/hypothetical TurboPython to 
have a chance at going mainstream, it needs a record/struct type, which 
in Python is namedtuple. That does not necessarily mean it needs to 
exec() arbitrary code. It just needs a way to create classes dynamically.

Even Java can do that.

http://stackoverflow.com/questions/2320404/

> minus the bits that aren't
> supported. We just cut a few of those nasty dynamic features out of the
> language, and voila! it becomes faster.

That's exactly what I'm talking about. Anyone who says "voila! it becomes 
faster" and means it is too foolish to live :-)

Cutting out the dynamic features is just the first step in allowing you 
to re-write the interpreter with more optimizations. A non-optimizing 
compiler doesn't magically become optimizing just because you've deleted 
some code from the interpreter.

> (Or in this case,
> JITable.) There's nothing suggested here about reimplementing existing
> features in a different way, 

Except for common bloody sense :-P

[...]
> But again, you certainly COULD reimplement literal_eval, but then you
> have to keep your implementation accurate and up-to-date, else you risk
> bugs creeping in. It's a non-trivial task to rewrite these kinds of
> things and maintain parallel versions; 

Sure, it's not thirty seconds work, but it's not *hard* to write a parser 
that accepts a small subset of Python expressions (those involving only 
literals and a handful of built-in displays).

Are you aware that literal_eval is *not* the same code that the Python 
interpreter uses for evaluating literals? They are kept in sync manually, 
or not kept in sync as the case may be.

Python 2.7 didn't support the set display syntax even though Python did:

>>> from ast import literal_eval
>>> {1, 2}
set([1, 2])

>>> literal_eval("{1, 2}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/ast.py", line 80, in literal_eval
    return _convert(node_or_string)
  File "/usr/lib/python2.7/ast.py", line 79, in _convert
    raise ValueError('malformed string')
ValueError: malformed string

literal_eval also had a bug that allowed it to evaluate the + and - 
operator, needed to support complex numbers:

http://stackoverflow.com/questions/40584417/why-does-ast-literal-eval5-7

The point being, while it's not *ideal* that support for evaluating 
Python literals needs to be written twice (once in the real interpreter, 
once in literal_eval) its not unusual or uncommon or a particularly 
onerous requirement.

> by calling on compile(), the
> current implementation *for free* is kept up-to-date with changes in
> CPython's grammar. 

That's incorrect. literal_eval() needs a whitelist of what's allowed, and 
a blacklist of what is allowed by the whitelist but shouldn't be. There 
may be other work needed as well. See the source code :-)

[...]
> Yes. As long as it can know when the more dynamic features are being
> used - and that's the hard part. 

Hard, but not impossible, and not that hard. Look at Nuitka and FATPython.

The thing that people forget is that not all errors are equal. There are 
two ways that the optimizing compiler can do the wrong thing:

- fail to apply an optimization when it is safe to do so;

- apply an optimization when it is unsafe to do so.

Those errors are not equivalent. In the first case, all that happens is 
that you lose an opportunity to be a little bit faster, but the executed 
code is still correct. (Or at least no less correct than what you wrote.) 
If you even notice, which you might not, it might be tricky to work out 
why the compiler isn't applying the optimization, but in principle it can 
be done and then worked around, or the compiler made a bit smarter.

For example, it might be that the compiler applies an absolutely dead-
simple rule: *any* use of eval or exec disables *all* optimizations. 
Don't like it? Then don't use eval or exec. If that's not an option for 
you, patches are welcome.

The second case is much worse. If you're lucky, the "optimized" code will 
merely crash, and you'll know it is broken. But worse, it may be subtly 
incorrect, and do the wrong thing. Calculate a *very slightly off* 
result. Or transfer funds to the wrong account, delete the wrong files, 
turn off the wrong life-support machine, that sort of thing.

The second case must be avoided at all costs. (Or at least, unless the 
programmer consents to unsafe optimizations that may change behaviour, 
such as many unsafe floating point optermizations.)

The first case merely indicates room for improvement. Patches are welcome.

> Remember, as soon as you have a single
> class implemented in Python, it could have a method injected into it
> without your knowledge. Can you detect that statically, or can you use
> the versioning of __dict__ to notice that something's been broken? What
> makes you fall back to the default implementation?

Ask Victor Stinner.

You ask these rhetorical questions as if they're major impediments, but 
they actually are more-or-less solved problems. The tricky part is just 
applying the solutions to *Python* itself. It's one thing to have a 
general strategy for optimizations, its another to apply it to a real 
language, and especially a real implementation.

But effectively, yes, I believe that dict versioning can be used to do 
this. Suppose you have code that says:

result = myinstance.method(99)

So you compile byte-code something like this pseudo-code:

    lookup myinstance
    if type(myinstance) is the expected type and neither the type 
             nor instance __dict__ have changed:
        push 99 on the stack
        JUMP TO CODE AT 123456  # original location of method's code
        return

    else:
        get myinstance
        lookup method in myinstance.__dict__
        if not found:
            lookup method in each of type(myisinstance).__mro__ or fail
        wrap the found object in MethodType
        build a set of positional and keyword arguments
        pass those positional and keyword arguments to method object
        which unpacks them and assigns to parameters
        location = look up the method's code object
        set up the book-keeping needed
        JUMP TO CODE AT location  # probably 123456 as above
        tidy up book-keeping
        return

The first case avoids a HUGE amount of work, at the cost of checking a 
couple of guards ahead of time. So the fast path becomes (say) 20% 
faster, and the slow path becomes 2% slower. Even if you take the slow 
path nine times as often as the fast path, it's still an overall win.

But you don't get that just be selecting the C code for eval() and exec() 
in your editor and pressing Delete.

-- 
Steve