PEP 330 – Python Bytecode Verification

Author:: Michel Pelletier <michel at users.sourceforge.net>
Status:: Rejected
Type:: Standards Track
Created:: 17-Jun-2004
Python-Version:: 2.6
Post-History:

Table of Contents

Abstract
Pronouncement
Motivation
Static Constraints on Bytecode Instructions
Static Constraints on Bytecode Instruction Operands
Structural Constraints between Bytecode Instructions
Implementation
Verification Issues
Required Changes
References
Copyright

Abstract

If Python Virtual Machine (PVM) bytecode is not “well-formed” it is possible to crash or exploit the PVM by causing various errors such as under/overflowing the value stack or reading/writing into arbitrary areas of the PVM program space. Most of these kinds of errors can be eliminated by verifying that PVM bytecode does not violate a set of simple constraints before execution.

This PEP proposes a set of constraints on the format and structure of Python Virtual Machine (PVM) bytecode and provides an implementation in Python of this verification process.

Pronouncement

Guido believes that a verification tool has some value. If someone wants to add it to Tools/scripts, no PEP is required.

Such a tool may have value for validating the output from “bytecodehacks” or from direct edits of PYC files. As security measure, its value is somewhat limited because perfectly valid bytecode can still do horrible things. That situation could change if the concept of restricted execution were to be successfully resurrected.

Motivation

The Python Virtual Machine executes Python programs that have been compiled from the Python language into a bytecode representation. The PVM assumes that any bytecode being executed is “well-formed” with regard to a number implicit constraints. Some of these constraints are checked at run-time, but most of them are not due to the overhead they would create.

When running in debug mode the PVM does do several run-time checks to ensure that any particular bytecode cannot violate these constraints that, to a degree, prevent bytecode from crashing or exploiting the interpreter. These checks add a measurable overhead to the interpreter, and are typically turned off in common use.

Bytecode that is not well-formed and executed by a PVM not running in debug mode may create a variety of fatal and non-fatal errors. Typically, ill-formed code will cause the PVM to seg-fault and cause the OS to immediately and abruptly terminate the interpreter.

Conceivably, ill-formed bytecode could exploit the interpreter and allow Python bytecode to execute arbitrary C-level machine instructions or to modify private, internal data structures in the interpreter. If used cleverly this could subvert any form of security policy an application may want to apply to its objects.

Practically, it would be difficult for a malicious user to “inject” invalid bytecode into a PVM for the purposes of exploitation, but not impossible. Buffer overflow and memory overwrite attacks are commonly understood, particularly when the exploit payload is transmitted unencrypted over a network or when a file or network security permission weakness is used as a foothold for further attacks.

Ideally, no bytecode should ever be allowed to read or write underlying C-level data structures to subvert the operation of the PVM, whether the bytecode was maliciously crafted or not. A simple pre-execution verification step could ensure that bytecode cannot over/underflow the value stack or access other sensitive areas of PVM program space at run-time.

This PEP proposes several validation steps that should be taken on Python bytecode before it is executed by the PVM so that it compiles with static and structure constraints on its instructions and their operands. These steps are simple and catch a large class of invalid bytecode that can cause crashes. There is also some possibility that some run-time checks can be eliminated up front by a verification pass.

There is, of course, no way to verify that bytecode is “completely safe”, for every definition of complete and safe. Even with bytecode verification, Python programs can and most likely in the future will seg-fault for a variety of reasons and continue to cause many different classes of run-time errors, fatal or not. The verification step proposed here simply plugs an easy hole that can cause a large class of fatal and subtle errors at the bytecode level.

Currently, the Java Virtual Machine (JVM) verifies Java bytecode in a way very similar to what is proposed here. The JVM Specification version 2 [1], Sections 4.8 and 4.9 were therefore used as a basis for some of the constraints explained below. Any Python bytecode verification implementation at a minimum must enforce these constraints, but may not be limited to them.

Static Constraints on Bytecode Instructions

The bytecode string must not be empty. (len(co_code) > 0).
The bytecode string cannot exceed a maximum size (len(co_code) < sizeof(unsigned char) - 1).
The first instruction in the bytecode string begins at index 0.
Only valid byte-codes with the correct number of operands can be in the bytecode string.

Static Constraints on Bytecode Instruction Operands

The target of a jump instruction must be within the code boundaries and must fall on an instruction, never between an instruction and its operands.
The operand of a LOAD_* instruction must be a valid index into its corresponding data structure.
The operand of a STORE_* instruction must be a valid index into its corresponding data structure.

Structural Constraints between Bytecode Instructions

Each instruction must only be executed with the appropriate number of arguments in the value stack, regardless of the execution path that leads to its invocation.
If an instruction can be executed along several different execution paths, the value stack must have the same depth prior to the execution of the instruction, regardless of the path taken.
At no point during execution can the value stack grow to a depth greater than that implied by co_stacksize.
Execution never falls off the bottom of co_code.

Implementation

This PEP is the working document for a Python bytecode verification implementation written in Python. This implementation is not used implicitly by the PVM before executing any bytecode, but is to be used explicitly by users concerned about possibly invalid bytecode with the following snippet:

import verify
verify.verify(object)

The verify module provides a verify function which accepts the same kind of arguments as dis.dis: classes, methods, functions, or code objects. It verifies that the object’s bytecode is well-formed according to the specifications of this PEP.

If the code is well-formed the call to verify returns silently without error. If an error is encountered, it throws a VerificationError whose argument indicates the cause of the failure. It is up to the programmer whether or not to handle the error in some way or execute the invalid code regardless.

Phillip Eby has proposed a pseudo-code algorithm for bytecode stack depth verification used by the reference implementation.

Verification Issues

This PEP describes only a small number of verifications. While discussion and analysis will lead to many more, it is highly possible that future verification may need to be done or custom, project-specific verifications. For this reason, it might be desirable to add a verification registration interface to the test implementation to register future verifiers. The need for this is minimal since custom verifiers can subclass and extend the current implementation for added behavior.

Required Changes

Armin Rigo noted that several byte-codes will need modification in order for their stack effect to be statically analyzed. These are END_FINALLY, POP_BLOCK, and MAKE_CLOSURE. Armin and Guido have already agreed on how to correct the instructions. Currently the Python implementation punts on these instructions.

This PEP does not propose to add the verification step to the interpreter, but only to provide the Python implementation in the standard library for optional use. Whether or not this verification procedure is translated into C, included with the PVM or enforced in any way is left for future discussion.

References

Copyright

This document has been placed in the public domain.

Source: https://github.com/python/peps/blob/main/peps/pep-0330.rst

Last modified: 2023-09-09 17:39:29 GMT