Notice: While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience.

PEP 675 -- Arbitrary Literal Strings

PEP:675
Title:Arbitrary Literal Strings
Author:Pradeep Kumar Srinivasan <gohanpra at gmail.com>, Graham Bleaney <gbleaney at gmail.com>
Sponsor:Jelle Zijlstra <jelle.zijlstra at gmail.com>
Discussions-To:Typing-Sig <typing-sig at python.org>
Status:Draft
Type:Standards Track
Created:30-Nov-2021
Python-Version:3.11
Post-History:

Abstract

There is currently no way to specify that a function parameter can be of any literal string type; we have to specify the precise literal string, such as Literal["foo"]. This PEP introduces a supertype of literal string types: Literal[str]. This allows a function to accept arbitrary literal string types such as Literal["foo"] or Literal["bar"].

Motivation

Powerful APIs that execute SQL or shell commands often recommend that they be invoked with literal strings, rather than arbitrary user controlled strings. There is no way to express this recommendation in the type system, however, meaning security vulnerabilities sometimes occur when developers fail to follow it. For example, a naive way to look up a user record from a database is to accept a user id and insert it into a predefined SQL query:

def query_user(conn: Connection, user_id: str) -> User:
    query = f"SELECT * FROM data WHERE user_id = {user_id}"
    conn.execute(query)

query_user(conn, "user123")  # OK.

However, the user-controlled data user_id is being mixed with the SQL command string, which means a malicious user could run arbitrary SQL commands:

# Delete the table.
query_user(conn, "user123; DROP TABLE data;")

# Fetch all users (since 1 = 1 is always true).
query_user(conn, "user123 OR 1 = 1")

To prevent such SQL injection attacks, SQL APIs offer parameterized queries, which separate the executed query from user-controlled data and make it impossible to run arbitrary queries. For example, with sqlite3, our original function would be written safely as a query with parameters:

def query_user(conn: Connection, user_id: str) -> User:
    query = "SELECT * FROM data WHERE user_id = ?"
    conn.execute(query, (user_id,))

The problem is that there is no way to enforce this discipline. sqlite3's own documentation can only admonish the reader to not dynamically build the sql argument from external input; the API's authors cannot express that through the type system. Users can (and often do) still use a convenient f-string as before and leave their code vulnerable to SQL injection.

Existing tools, such as the popular security linter Bandit, attempt to detect unsafe external data used in SQL APIs, by inspecting the AST or by other semantic pattern-matching. These tools, however, preclude common idioms like storing a large multi-line query in a variable before executing it, adding literal string modifiers to the query based on some conditions, or transforming the query string using a function. (We survey existing tools in the "Rejected Alternatives" section.) For example, many tools will detect a false positive issue in this benign snippet:

def query_data(conn: Connection, user_id: str, limit: bool) -> None:
    query = """
        SELECT
            user.name,
            user.age
        FROM data
        WHERE user_id = ?
    """
    if limit:
        query += " LIMIT 1"

    conn.execute(query, (user_id,))

We want to forbid harmful execution of user-controlled data while still allowing benign idioms like the above and not requiring extra user work.

To meet this goal, we introduce the Literal[str] type, which only accepts string values that are known to be made of literals. This is a generalization of the Literal["foo"] type from PEP 586. A string of type Literal[str] cannot contain user-controlled data. Thus, any API that only accepts Literal[str] will be immune to injection vulnerabilities (with pragmatic limitations).

Since we want the sqlite3 execute method to disallow strings built with user input, we would make its typeshed stub accept a sql query that is of type Literal[str]:

def execute(self, sql: Literal[str], parameters: Iterable[str] = ...) -> Cursor: ...

This successfully forbids our unsafe SQL example. The variable query below is inferred to have type str, since it is created from a format string using user_id, and cannot be passed to execute:

def query_user(conn: Connection, user_id: str) -> User:
    query = f"SELECT * FROM data WHERE user_id = {user_id}"
    conn.execute(query)
    # Error: Expected Literal[str], got str.

The method remains flexible enough to allow our more complicated example:

def query_data(conn: Connection, user_id: str, limit: bool) -> None:
    # This is a literal string.
    query = """
        SELECT
            user.name,
            user.age
        FROM data
        WHERE user_id = ?
    """

    if limit:
        # Still has type Literal[str] because we added a literal string.
        query += " LIMIT 1"

    conn.execute(query, (user_id,))  # OK

Notice that the user did not have to change their SQL code at all. The type checker was able to infer the literal string type and complain only in case of violations.

Literal[str] is also useful in other cases where we want strict command-data separation, such as when building shell commands or when rendering a string into an HTML response without escaping (see Appendix A: Other Uses). Overall, this combination of strictness and flexibility makes it easy to enforce safer API usage in sensitive code without burdening users.

Usage statistics

In a sample of open-source projects using sqlite3, we found that conn.execute was called ~67% of the time with a safe string literal and ~33% of the time with a potentially unsafe, local string variable. Using this PEP's literal string type along with a type checker would prevent the unsafe portion of that 33% of cases (ie. the ones where user controlled data is incorporated into the query), while seamlessly allowing the safe ones to remain.

Rationale

Firstly, why use types to prevent security vulnerabilities?

Warning users in documentation is insufficient - most users either never see these warnings or ignore them. Using an existing dynamic or static analysis approach is too restrictive - these prevent natural idioms, as we saw in the Motivation section (and will discuss more extensively in the Rejected Alternatives section). The typing-based approach in this PEP strikes a user-friendly balance between strictness and flexibility.

Runtime approaches do not work because, at runtime, the query string is a plain str. While we could prevent some exploits using heuristics, such as regex-filtering for obviously malicious payloads, there will always be a way to work around them (perfectly distinguishing good and bad queries reduces to the halting problem).

Static approaches like checking the AST to see if the query string is a literal string expression cannot tell when a string is assigned to an intermediate variable or when it is transformed by a benign function. This makes them overly restrictive.

The type checker, surprisingly, does better than both because it has access to information not available in the runtime or static analysis approaches. Specifically, the type checker can tell us whether an expression has a literal string type, say Literal["foo"]. The type checker already propagates types across variable assignments or function calls.

In the current type system itself, if the SQL or shell command execution function only accepted three possible input strings, our job would be done. We would just say:

def execute(query: Literal["foo", "bar", "baz"]) -> None: ...

But, of course, execute can accept any possible query. How do we ensure that the query does not contain an arbitrary, user-controlled string?

We want to specify that the value must be of some type Literal[<...>] where <...> is some string. This is what Literal[str] represents. Literal[str] is the "supertype" of all literal string types. In effect, this PEP just introduces a type in the type hierarchy between Literal["foo"] and str. Any particular literal string such as Literal["foo"] or Literal["bar"] is compatible with Literal[str], but not the other way around. The "supertype" of Literal[str] itself is str. So, Literal[str] is compatible with str, but not the other way around.

Note that a Union of literal types is naturally compatible with Literal[str] because each element of the Union is individually compatible with Literal[str]. So, Literal["foo", "bar"] is compatible with Literal[str].

However, recall that we don't just want to represent exact literal queries. We also want to support composition of two literal strings, such as query + " LIMIT 1". This too is possible with the above concept. If x and y are two values of type Literal[str], then x + y will also be of type compatible with Literal[str]. We can reason about this by looking at specific instances such as Literal["foo"] and Literal["bar"]; the value of the added string x + y can only be "foobar", which has type Literal["foobar"] and is thus compatible with Literal[str]. The same reasoning applies when x and y are unions of literal types; the result of pairwise adding any two literal types from x and y respectively is a literal type, which means that the overall result is a Union of literal types and is thus compatible with Literal[str].

In this way, we are able to leverage Python's concept of a Literal string type to specify that our API can only accept strings that are known to be constructed from literals. More specific details follow in the remaining sections.

Valid Locations for Literal[str]

Literal[str] can be used where any other type can be used:

variable_annotation: Literal[str]

def my_function(literal_string: Literal[str]) -> Literal[str]: ...

class Foo:
    my_attribute: Literal[str]

type_argument: List[Literal[str]]

T = TypeVar("T", bound=Literal[str])

It can be nested within unions of Literal types:

union: Literal["hello", Literal[str]]
union2: Literal["hello", str]
union3: Literal[str, 4]

nested_literal_string: Literal[Literal[str]]

The restrictions on the parameters of Literal are the same as in PEP 586. The only legal parameter is the literal value str. Other values are rejected even if they evaluate to the same value (str), such as Literal[(lambda x: x)(str)].

Type Inference

Inferring Literal[str]

Any literal string type is compatible with Literal[str]. For example, x: Literal[str] = "foo" is valid because "foo" is inferred to be of type Literal["foo"].

As per the Rationale, we also infer Literal[str] in the following cases:

  • Addition: x + y is of type Literal[str] if both x and y are compatible with Literal[str].
  • Joining: sep.join(xs) is of type Literal[str] if sep's type is compatible with Literal[str] and xs's type is compatible with Iterable[Literal[str]].
  • In-place addition: If s has type Literal[str] and x has type compatible with Literal[str], then s += x preserves s's type as Literal[str].
  • String formatting: An f-string has type Literal[str] if and only if its constituent expressions are literal strings. s.format(...) has type Literal[str] if and only if s and the arguments have types compatible with Literal[str].

In all other cases, if one or more of the composed values has a non-literal type str, the composition of types will have type str. For example, if s has type str, then "hello" + s has type str. This matches the pre-existing behavior of type checkers.

Literal[str] is compatible with the type str. It inherits all methods from str. So, if we have a variable s of type Literal[str], it is safe to write s.startswith("hello").

Note that, beyond the few composition rules mentioned above, this PEP doesn't change inference for other str methods such as literal_string.upper().

Some type checkers refine the type of a string when doing an equality check:

def foo(s: str) -> None:
    if s == "bar":
        reveal_type(s)  # => Literal["bar"]

Such a refined type in the if-block is also compatible with Literal[str] because its type is Literal["bar"].

Examples

See the examples below to help clarify the above rules:

literal_string: Literal[str]
s: str = literal_string  # OK

literal_string: Literal[str] = s  # Error: Expected Literal[str], got str.
literal_string: Literal[str] = "hello" # OK


def expect_literal_str(s: Literal[str]) -> None: ...

Addition of literal strings:

expect_literal_str("foo" + "bar")  # OK
expect_literal_str(literal_string + "bar")  # OK
literal_string2: Literal[str]
expect_literal_str(literal_string + literal_string2)  # OK
plain_str: str
expect_literal_str(literal_string + plain_str)  # Not OK.

Join using literal strings:

expect_literal_str(",".join(["foo", "bar"]))  # OK
expect_literal_str(literal_string.join(["foo", "bar"]))  # OK
expect_literal_str(literal_string.join([literal_string, literal_string2]))  # OK
xs: List[Literal[str]]
expect_literal_str(literal_string.join(xs)) # OK
expect_literal_str(plain_str.join([literal_string, literal_string2]))
# Not OK because the separator has type ``str``.

In-place addition using literal strings:

literal_string += "foo"  # OK
literal_string += literal_string2  # OK
literal_string += plain_str # Not OK

Format strings using literal strings:

literal_name: Literal[str]
expect_literal_str(f"hello {literal_name}")
# OK because it is composed from literal strings.

expect_literal_str("hello {}".format(literal_name))  # OK

expect_literal_str(f"hello")  # OK

expect_literal_str(f"hello {username}")
# NOT OK. The format-string is constructed from ``username``,
# which has type ``str``.

expect_literal_str("hello {}".format(username))  # Not OK

Other literal types, such as literal integers, are not compatible with Literal[str]:

some_int: int
expect_literal_str(some_int)  # Error: Expected Literal[str], got int.

literal_one: Literal[1] = 1
expect_literal_str(literal_one)  # Error: Expected Literal[str], got Literal[1].

We can call functions on literal strings:

def add_limit(query: Literal[str]) -> Literal[str]:
    return query + " LIMIT = 1"

def my_query(query: Literal[str], user_id: str) -> None:
    sql_connection().execute(add_limit(query), (user_id,))  # OK

Conditional statements and expressions work as expected:

def return_literal_str() -> Literal[str]:
    return "foo" if condition1() else "bar"  # OK

def return_literal_str2(literal_str: Literal[str]) -> Literal[str]:
    return "foo" if condition1() else literal_str  # OK

def return_literal_str3() -> Literal[str]:
    if condition1():
        result: Literal["foo"] = "foo"
    else:
        result: Literal[str] = "bar"

    return result  # OK

Interaction with TypeVars and Generics

TypeVars can be bound to Literal[str]:

from typing import Literal, TypeVar

TLiteral = TypeVar("TLiteral", bound=Literal[str])

def literal_identity(s: TLiteral) -> TLiteral:
    return s

hello: Literal["hello"] = "hello"
y = literal_identity(hello)
reveal_type(y)  # => Literal["hello"]

s: Literal[str]
y2 = literal_identity(s)
reveal_type(y2)  # => Literal[str]

s_error: str
literal_identity(s_error)
# Error: Expected TLiteral (bound to Literal[str]), got str.

Literal[str] can be used as type arguments for generic classes:

class Container(Generic[T]):
    def __init__(self, value: T) -> None:
        self.value = value

literal_str: Literal[str] = "hello"
x: Container[Literal[str]] = Container(literal_str)  # OK

s: str
x_error: Container[Literal[str]] = Container(s)  # Not OK

Standard containers like List work as expected:

xs: List[Literal[str]] = ["foo", "bar", "baz"]

Interactions with Overloads

Literal strings and overloads do not need to interact in a special way: the existing rules work fine. Literal[str] can be used as a fallback overload where a specific Literal["foo"] type does not match:

@overload
def foo(x: Literal["foo"]) -> int: ...
@overload
def foo(x: Literal[str]) -> bool: ...
@overload
def foo(x: str) -> str: ...

x1: int = foo("foo")  # First overload.
x2: bool = foo("bar")  # Second overload.
s: str
x3: str = foo(s)  # Third overload.

Backwards Compatibility

Literal[str] is acceptable at runtime, so this doesn't require any changes to the Python runtime itself. PEP 586 already backports Literal, so this PEP does not need to change it.

As PEP 586 mentions, type checkers "should feel free to experiment with more sophisticated inference techniques". So, if the type checker infers a literal string type for an unannotated variable that is initialized with a literal string, the following example should be OK:

x = "hello"
expect_literal_str(x)
# OK, because x is inferred to have type ``Literal["hello"]``.

This enables precise type checking of idiomatic SQL query code without annotating the code at all (as seen in the Motivation section example).

However, like PEP 586, this PEP does not mandate the above inference strategy. In case the type checker doesn't infer x to have type Literal["hello"], users can aid the type checker by explicitly annotating it as x: Literal[str]:

x: Literal[str] = "hello"
expect_literal_str(x)

Runtime Behavior

This PEP does not change the runtime behavior of Literal.

Rejected Alternatives

Why not use tool X?

Focusing solely on the example of preventing SQL injection, tooling to catch this kind of issue seems to come in three flavors: AST based, function level analysis, and taint flow analysis.

AST based tools include Bandit: Bandit has a plugin to warn when SQL queries are not literal strings. The problem is that many perfectly safe SQL queries are dynamically built out of string literals, as shown in the Motivation section. At the AST level, the resultant SQL query is not going to appear as a string literal anymore and is thus indistinguishable from a potentially malicious string. To use these tools would require significantly restricting developers' ability to build SQL queries. Literal[str] can provide similar safety guarantees with fewer restrictions.

Semgrep and pyanalyze: Semgrep supports a more sophisticated function level analysis, including constant propagation within a function. This allows us to prevent injection attacks while permitting some forms of safe dynamic SQL queries within a function. pyanalyze has a similar extension. But neither handles function calls that construct and return safe SQL queries. For example, in the code sample below, build_insert_query is a helper function to create a query that inserts multiple values into the corresponding columns. Semgrep and pyanalyze forbid this natural usage whereas Literal[str] handles it with no burden on the programmer:

def build_insert_query(
    table: Literal[str]
    insert_columns: Iterable[Literal[str]],
) -> Literal[str]:
    sql = "INSERT INTO " + table

    column_clause = ", ".join(insert_columns)
    value_clause = ", ".join(["?"] * len(insert_columns))

    sql += f" ({column_clause}) VALUES ({value_clause})"
    return sql

def insert_data(
    conn: Connection,
    kvs_to_insert: Dict[Literal[str], str]
) -> None:
    query = build_insert_query("data", kvs_to_insert.keys())
    conn.execute(query, kvs_to_insert.values())

# Example usage
data_to_insert = {
    "column_1": value_1, # Note: values are not literals
    "column_2": value_2,
    "column_3": value_3,
}
insert_data(conn, data_to_insert)

Taint flow analysis: Tools such as Pysa or CodeQL are capable of tracking data flowing from a user controlled input into a SQL query. These tools are powerful but involve considerable overhead in setting up the tool in CI, defining "taint" sinks and sources, and teaching developers how to use them. They also usually take longer to run than a type checker (minutes instead of seconds), which means feedback is not immediate. Finally, they move the burden of preventing vulnerabilities on to library users instead of allowing the libraries themselves to specify precisely how their APIs must be called (as is possible with Literal[str]).

Why not use a NewType for str?

Any API for which Literal[str] would be suitable could instead be updated to accept a different type created within the Python type system, such as NewType("SafeSQL", str):

SafeSQL = NewType("SafeSQL", str)


def execute(self, sql: SafeSQL, parameters: Iterable[str] = ...) -> Cursor: ...

execute(SafeSQL("SELECT * FROM data WHERE user_id = ?"), user_id)  # OK

user_query: str
execute(user_query)  # Error: Expected SafeSQL, got str.

Having to create a new type to call an API might give some developers pause and encourage more caution, but it doesn't guarantee that developers won't just turn a user controlled string into the new type, and pass it into the modified API anyway:

query = f"SELECT * FROM data WHERE user_id = f{user_id}"
execute(SafeSQL(query))  # No error!

We are back to square one with the problem of preventing arbitrary inputs to SafeSQL. This is not a theoretical concern either. Django uses the above approach with SafeString and mark_safe. Issues such as CVE-2020-13596 show how this technique can fail.

Also note that this requires invasive changes to the source code (wrapping the query with SafeSQL) whereas Literal[str] requires no such changes. Users can remain oblivious to it as long as they pass in literal strings to sensitive APIs.

Why not try to emulate Trusted Types?

Trusted Types is a W3C specification for preventing DOM-based Cross Site Scripting (XSS). XSS occurs when dangerous browser APIs accept raw user-controlled strings. The specification modifies these APIs to accept only the "Trusted Types" returned by designated sanitizing functions. These sanitizing functions must take in a potentially malicious string and validate it or render it benign somehow, for example by verifying that it is a valid URL or HTML-encoding it.

It can be tempting to assume porting the concept of Trusted Types to Python could solve the problem. The fundamental difference, however, is that the output of a Trusted Types sanitizer is usually intended to not be executable code. Thus it's easy to HTML encode the input, strip out dangerous tags, or otherwise render it inert. With a SQL query or shell command, the end result still needs to be executable code. There is no way to write a sanitizer that can reliably figure out which parts of an input string are benign and which ones are potentially malicious.

Runtime Checkable Literal[str]

The Literal[str] concept could be extended beyond static type checking to be a runtime checkable property of str objects. This would provide some benefits, such as allowing frameworks to raise errors on dynamic strings. Such runtime errors would be a more robust defense mechanism than type errors, which can potentially be suppressed, ignored, or never even seen if the author does not use a type checker.

This extension to the Literal[str] concept would dramatically increase the scope of the proposal by requiring changes to one of the most fundamental types in Python. While runtime taint checking on strings has been considered and attempted in the past, and others may consider it in the future, such extensions are out of scope for this PEP.

Reference Implementation

This is implemented in Pyre v0.9.8 and is actively being used.

The implementation simply extends the type checker with Literal[str] as a supertype of literal string types.

To support composition via addition, join, etc., it was sufficient to overload the stubs for str in Pyre's copy of typeshed. For example, we replaced str __add__:

# Before:
def __add__(self, s: str) -> str: ...

# After:
@overload
def __add__(self: Literal[str], other: Literal[str]) -> Literal[str]: ...
@overload
def __add__(self, other: str) -> str: ...

This means that addition of non-literal string types remains to have type str. The only change is that addition of literal string types now produces Literal[str].

One implementation strategy is to update the official Typeshed stub for str with these changes.

Appendix A: Other Uses

To simplify the discussion and require minimal security knowledge, we focused on SQL injections throughout the PEP. Literal[str], however, can also be used to prevent many other kinds of injection vulnerabilities.

Command Injection

APIs such as subprocess.run accept a string which can be run as a shell command:

subprocess.run(f"echo 'Hello {name}'", shell=True)

If attacker controlled data is included in the command string, a command injection vulnerability exists and malicious operations can be run. For example, a value of ' && rm -rf / # would result in the following destructive command being run:

echo 'Hello ' && rm -rf / #'

This vulnerability could be prevented by updating run to only accept Literal[str] when used in shell=True mode. Here is one simplified stub:

def run(command: Literal[str], *args: str, shell: bool=...): ...

Cross Site Scripting (XSS)

Most popular Python web frameworks, such as Django, use a templating engine to produce HTML from user data. These templating languages auto-escape user data before inserting it into the HTML template and thus prevent cross site scripting (XSS) vulnerabilities.

But a common way to bypass auto-escaping and render HTML as-is is to use functions like mark_safe in Django or do_mark_safe in Jinja2, which cause XSS vulnerabilities:

dangerous_string = django.utils.safestring.mark_safe(f"<script>{user_input}</script>")
return(dangerous_string)

This vulnerability could be prevented by updating mark_safe to only accept Literal[str]:

def mark_safe(s: Literal[str]) -> str: ...

Server Side Template Injection (SSTI)

Templating frameworks such as Jinja allow Python expressions which will be evaluated and substituted into the rendered result:

template_str = "There are {{ len(values) }} values: {{ values }}"
template = jinja2.Template(template_str)
template.render(values=[1, 2])
# Result: "There are 2 values: [1, 2]"

If an attacker controls all or part of the template string, they can insert expressions which execute arbitrary code and compromise the application:

malicious_str = "{{''.__class__.__base__.__subclasses__()[408]('rm - rf /',shell=True)}}"
template = jinja2.Template(malicious_str)
template.render()
# Result: The shell command 'rm - rf /' is run

Template injection exploits like this could be prevented by updating the Template API to only accept Literal[str]:

class Template:
    def __init__(self, source: Literal[str]): ...

Appendix B: Limitations

There are a number of ways Literal[str] could still fail to prevent users from passing strings built from non-literal data to an API:

1. If the developer does not use a type checker or does not add type annotations, then violations will go uncaught.

2. cast(Literal[str], non_literal_str) could be used to lie to the type checker and allow a dynamic string value to masquerade as a Literal[str]. The same goes for a variable that has type Any.

3. Comments such as # type: ignore could be used to ignore warnings about non-literal strings.

4. Trivial functions could be constructed to convert a str to a Literal[str]:

def make_literal(s: str) -> Literal[str]:
    letters: Dict[str, Literal[str]] = {
        "A": "A",
        "B": "B",
        ...
    }
    output: List[Literal[str]] = [letters[c] for c in s]
    return "".join(output)

We could mitigate the above using linting, code review, etc., but ultimately a clever, malicious developer attempting to circumvent the protections offered by Literal[str] will always succeed. The important thing to remember is that Literal[str] is not intended to protect against malicious developers; it is meant to protect against benign developers accidentally using sensitive APIs in a dangerous way (without getting in their way otherwise).

Without Literal[str], the best enforcement tool API authors have is documentation, which is easily ignored and often not seen. With Literal[str], API misuse requires conscious thought and artifacts in the code that reviewers and future developers can notice.

Resources

Literal String Types in Scala

Scala uses Singleton as the supertype for singleton types, which includes literal string types such as "foo". Singleton is Scala's generalized analogue of this PEP's Literal[str].

Tamer Abdulradi showed how Scala's literal string types can be used for "Preventing SQL injection at compile time", Scala Days talk Literal types: What are they good for? (slides 52 to 68).

Thanks

Thanks to the following people for their feedback on the PEP:

Edward Qiu, Jia Chen, Shannon Zhu, Gregory P. Smith, Никита Соболев, and Shengye Wan

Source: https://github.com/python/peps/blob/master/pep-0675.rst