Notice: While Javascript is not essential for this website, your interaction with the content will be limited. Please turn Javascript on for the full experience.

PEP 618 -- Add Optional Length-Checking To zip

PEP:618
Title:Add Optional Length-Checking To zip
Author:Brandt Bucher <brandtbucher at gmail.com>
Sponsor:Antoine Pitrou <antoine at python.org>
Status:Draft
Type:Standards Track
Created:01-May-2020
Python-Version:3.9
Post-History:01-May-2020
Resolution:

Abstract

This PEP proposes adding an optional strict boolean keyword parameter to the built-in zip. When enabled, a ValueError is raised if one of the arguments is exhausted before the others.

Motivation

Many Python users find that most of their zip usage involves iterables that should be of equal length. Sometimes this invariant is proven true from the context of the surrounding code, but often the data being zipped is passed from the caller, sourced separately, or generated in some fashion. In any of these cases, the default behavior of zip means that faulty refactoring or logic errors could easily result in silently losing data. These bugs are not only difficult to diagnose, but difficult to even detect at all.

It is easy to come up with simple cases where this could be a problem. For example, the following code may work fine when items is a sequence, but silently start producing shortened, mismatched results if items is refactored by the caller to be a consumable iterator:

def apply_calculations(items):
    transformed = transform(items)
    for x, y in zip(items, transformed):
        yield something(x, y)

There are several other ways in which zip is commonly used. Idiomatic tricks are especially susceptible, because they are often employed by users who lack a complete understanding of how the code works. One example is unpacking into zip to lazily "unzip" or "transpose" nested iterables:

>>> x = iter([iter([1, 2, 3]), iter(["one" "two" "three"])])
>>> xt = list(zip(*x))

Another is "chunking" data into equal-sized groups:

>>> n = 3
>>> x = range(n ** 2),
>>> xn = list(zip(*[iter(x)] * n))

In the first case, non-rectangular data is usually a logic error. In the second case, data with a length that is not a multiple of n is often an error as well. However, both of these idioms will silently omit the tail-end items of malformed input.

Perhaps most convincingly, the current use of zip in the standard-library ast module has created multiple bugs that silently drop parts of malformed nodes:

>>> from ast import Constant, Dict, literal_eval
>>> nasty_dict = Dict(keys=[Constant(None)], values=[])
>>> literal_eval(nasty_dict)  # Like eval("{None: }")
{}

In fact, the author has counted dozens of other call sites in Python's standard library and tooling where it would be appropriate to enable this new feature immediately.

Rationale

Some critics assert that constant boolean switches are a "code-smell", or go against Python's design philosophy. However, Python currently contains several examples of boolean keyword parameters on built-in functions which are typically called with compile-time constants:

  • compile(..., dont_inherit=True)
  • open(..., closefd=False)
  • print(..., flush=True)
  • sorted(..., reverse=True)

Many more exist in the standard library.

A good rule of thumb is that "mode-switches" which change return types or significantly alter functionality are indeed an anti-pattern, while ones which enable or disable complementary checks or behavior are not.

The idea and name for this new parameter were originally proposed by Ram Rachum. The thread received over 100 replies, with the alternative "equal" receiving a similar amount of support.

The author does not have a strong preference between the two choices, though "equal equals" is a bit awkward in prose. It may also (wrongly) imply some notion of "equality" between the zipped items:

>>> z = zip([2.0, 4.0, 6.0], [2, 4, 8], equal=True)

Specification

When the built-in zip is called with the keyword-only argument strict=True, the resulting iterator will raise a ValueError if the arguments are exhausted at differing lengths. This error will occur at the point when iteration would normally stop today.

At most one additional item may be consumed from one of the iterators when compared to normal zip usage.

Backward Compatibility

This change is fully backward-compatible.

Reference Implementation

The author has drafted a C implementation.

An approximate pure-Python translation is:

def zip(*iterables, strict=False):
    if not iterables:
        return
    iterators = tuple(iter(iterable) for iterable in iterables)
    try:
        while True:
            items = []
            for iterator in iterators:
                items.append(next(iterator))
            yield tuple(items)
    except StopIteration:
        if not strict:
            return
    if items:
        i = len(items) + 1
        raise ValueError(f"zip() argument {i} is too short")
    sentinel = object()
    for i, iterator in enumerate(iterators[1:], 2):
        if next(iterator, sentinel) is not sentinel:
            raise ValueError(f"zip() argument {i} is too long")

Rejected Ideas

Add Additional Flavors Of zip To itertools

Adding zip_strict to itertools is a larger change with greater maintenance burden than the simple modification being proposed.

It seems that a great deal of the motivation driving this alternative is that zip_longest already exists in itertools. However, zip_longest is really another beast entirely: it takes on the responsibility of filling in missing values, a problem neither of the other variants even have. It also arguably has the most specialized behavior of the three (to the point of exposing a new fillvalue parameter), so it makes sense that it would live in itertools while zip grows in-place.

Importing a drop-in replacement for a built-in also feels too heavy, especially just to check a tricky condition that should "always" be true. The goal here is not just to provide a way to catch bugs, but to also make it easy (even tempting) for a user to enable the check whenever using zip at a call site with this property.

Some have also argued that a new function buried in the standard library is somehow more "discoverable" than a keyword parameter on the built-in itself. The author does not believe this to be true.

Another proposed idiom, per-module shadowing of the built-in zip with some subtly different variant from itertools, is an anti-pattern that shouldn't be encouraged.

Add Several "Modes" To Switch Between

This option only makes more sense than a binary flag if we anticipate having three or more modes. The "obvious" three choices for these enumerated or constant modes would be "shortest" (the current zip behavior), "strict" (the proposed behavior), and "longest" (the itertools.zip_longest behavior).

However, it doesn't seem like adding behaviors other than the current default and the proposed "strict" mode is worth the additional complexity. The clearest candidate, "longest", would require a new fillvalue parameter (which is meaningless for both other modes). This mode is also already handled perfectly by itertools.zip_longest, and adding it would create two ways of doing the same thing. It's not clear which would be the "obvious" choice: the mode parameter on the built-in zip, or the long-lived namesake utility in itertools.

Add A Method Or Alternate Constructor To The zip Type

Consider the following two options, which have both been proposed:

>>> zm = zip(*iters).strict()
>>> zd = zip.strict(*iters)

It's not obvious which one will succeed, or how the other will fail. If zip.strict is implemented as a method, zm will succeed, but zd will fail in one of several confusing ways:

  • Yield results that aren't wrapped in a tuple (if iters contains just one item, a zip iterator).
  • Raise a TypeError for an incorrect argument type (if iters contains just one item, not a zip iterator).
  • Raise a TypeError for an incorrect number of arguments (otherwise).

If zip.strict is implemented as a classmethod or staticmethod, zd will succeed, and zm will silently yield nothing (which is the problem we are trying to avoid in the first place).

This proposal is further complicated by the fact that CPython's actual zip type is an undocumented implementation detail.

Change The Default Behavior Of zip

There is nothing "wrong" with the default behavior of zip, since there are many cases where it is indeed the correct way to handle unequally-sized inputs. It's extremely useful, for example, when dealing with infinite iterators.

itertools.zip_longest already exists to service those cases where the "extra" tail-end data is still needed.

Accept A Callback To Handle Remaining Items

While able to do basically anything a user could need, this solution makes handling the more common cases (like rejecting mismatched lengths) unnecessarily complicated and non-obvious.

Raise An AssertionError Instead Of A ValueError

There are no built-in functions or types that raise an AssertionError as part of their API. Further, the official documentation simply reads (in its entirety):

Raised when an assert statement fails.

Since this feature has nothing to do with Python's assert statement, raising an AssertionError here would be inappropriate. Users desiring a check that is disabled in optimized mode (like an assert statement) can use strict=__debug__ instead.

Add A Similar Feature to map

This PEP does not propose any changes to map, since the use of map with multiple iterable arguments is quite rare. However, this PEP's ruling shall serve as precedent such a future discussion (should it occur).

If rejected, the feature is realistically not worth pursuing. If accepted, such a change to map should not require its own PEP (though, like all enhancements, its usefulness should be carefully considered). For consistency, it should follow same API and semantics debated here for zip.

Do Nothing

This option is perhaps the least attractive.

Silently truncated data is a particularly nasty class of bug, and hand-writing a robust solution that gets this right isn't trivial. The real-world motivating examples from Python's own standard library are evidence that it's very easy to fall into the sort of trap that this feature aims to avoid.

Source: https://github.com/python/peps/blob/master/pep-0618.rst