[Python-ideas] Type hints for text/binary data in Python 2+3 code

Andrey Vlasovskikh andrey.vlasovskikh at gmail.com
Thu Mar 24 20:00:37 EDT 2016


Upon further investigation of the problem I've come up with an alternative idea that looks simpler and yet still capable of finding most text/binary conversion errors.

Here is a rendered Markdown version: https://gist.github.com/vlasovskikh/1a8d5effe95d5944b919


## TL;DR

* Introduce `typing.Text` for text data in Python 2+3
* `bytes`, `str`, `unicode`, `typing.Text` in type hints mean whatever they
  mean at runtime for Python 2 or 3
* Allow `str -> unicode` and `unicode -> str` promotions for Python 2
* Type checking for Python 2 *and* Python 3 actually finds most text/binary
  errors
* A few false negatives for Python 2 are not worth special handling besides
  possible ad-hoc handling of non-ASCII literals conversions


## Summary for Python users

If you want your code to be Python 2+3 compatible:

* Write text/binary type hints in 2+3 compatible comments
    * Use `typing.Text` for text data, `bytes` for binary data
    * Use `str` only for rare cases of "native strings"
    * Don't use `unicode` since it's absent in Python 3
* Run a type checker for *both* Python 2 and Python 3


## Summary for authors of type checkers

The semantics of types `bytes`, `str`, `unicode`, `typing.Text` and the type
checking rules for them should match the *runtime behavior* of these types in
Python 2 and Python 3 depending on Python 2 or 3 modes. Using the runtime
semantics for the types is easy to understand while it still allows to catch
most errors. The Python 2+3 compatibility mode is just a sum of Python 2 and
Python 3 warnings.

Type checkers *should* promote `str`/`bytes` to `unicode`/`Text` and
`unicode`/`Text` to `str`/`bytes` for Python 2. Most text/binary conversion
errors can be found by running a type checker for Python 2 *and* for Python 3.


## typing.Text: Python 2+3 compatible type for text data

The `typing.Text` type is a Python 2+3 compatible type for text data. It's
defined as follows:

    if sys.version_info < (3,):
        Text = unicode
    else:
        Text = str

For a Python 2+3 compatible type for binary data use `bytes` that is available
in both 2 and 3.


## Implicit text/binary conversions

In Python 2 text data is implicitly converted to binary data and vice versa
using the ASCII encoding. Only if the data isn't ASCII-compatible, then a
`UnicodeEncodeError` or a `UnicodeDecodeError` is raised. This results in many
programs that aren't well-tested regarding non-ASCII data handling.

In Python 3 converting text data to binary data always raises a `TypeError`.

A type checker run in the Python 3 mode will find most of Python 2 implicit
conversion errors.


## Checking for Python 2+3 compatibility

In order to be Python 2+3 compatible a program has to pass *both* Python 2 and
Python 3 type checking. In other words, the warnings found in the Python 2+3
compatible mode are a simple sum of Python 2 warnings and Python 3 warnings.


## Runtime type compatibility

Here is a table of types whose values are compatible at runtime. Columns are
the expected types, rows are the actual types:

            | Text  | bytes | str   | unicode
    --------+-------+-------+-------+---------
    Text    |  . .  |  * F  |  * .  |  . F
    bytes   |  * F  |  . .  |  . F  |  * F
    str     |  * .  |  . F  |  . .  |  * F
    unicode |  . F  |  * F  |  * F  |  . F

Each cell contains two characters: the result in Python 2 and in Python 3
respectively. Abbreviations:

* `.` — types are compatible
* `F` — types are not compatible
* `*` — types are compatible, ignoring implicit ASCII conversions

At runtime in Python 2 `str` is compatible with `unicode` and vice versa
(ignoring possible implicit ASCII conversion errors).

Using `unicode` in Python 3 is always an error since there is no `unicode` name
in Python 3.

As you can see from the table above, many implicit ASCII conversion
errors in a Python 2 program can be found just by running a type checker in the
Python 3 mode.

The only problematic conversions that may result in errors are `Text` to `str`
and vice versa in Python 2.

Example 1. `Text` to `str`

    def foo(obj, x):
        # type: (Any, str) -> Any
        return getattr(obj, x)
    
    foo(..., u'привет')  # False negative warning for non-ASCII in Python 2

Example 2. `str` to `Text`

    def foo(x):
        # type: (Text) -> Any
        return u'Привет, ' + x

    foo('Мир')  # False negative warning for non-ASCII in Python 2

For non-ASCII text literals passed to functions that expect `Text` or `str` in
Python 2 a type checker can analyze the contents of the literal and show
additional warnings based on this information. For non-ASCII data coming from
sources other than literals this check would be more complicated.

To summarize, with this type compatibility table in place, a type checker run
for *both* Python 2 and Python 3 is able to find *almost all errors* related to
text and binary data except for a few text to "native string" conversions and
vice versa in Python 2.


## Current Mypy type compatibility (non-runtime semantics)

Mypy implies `str` to `unicode` promotion for Python 2, but it doesn't promote
`unicode` to `str`. Here is an example of a Python 2 program that is correct
given the runtime type compatibility semantics shown in the table above, but is
incorrect for Mypy:

    def foo(obj, x):
        # type: (Any, str) -> Any
        return getattr(obj, x)
    
    foo({}, u'upper')  # False positive warning in Mypy for ASCII in Python 2

Here is the type compatibility table for the current version of Mypy:

            | Text  | bytes | str   | unicode
    --------+-------+-------+-------+---------
    Text    |  . .  |  F F  |  F .  |  . F
    bytes   |  * F  |  . .  |  . F  |  * F
    str     |  * .  |  . F  |  . .  |  * F
    unicode |  . F  |  F F  |  F F  |  . F

Running the Mypy type checker in Python 2 mode *and* Python 3 mode for the same
program would find almost all implicit ASCII conversion errors except for `str`
to `Text` conversions.

To summarize, the current Mypy type compatibility table covers almost all text
and binary data handling errors when used for *both* Python 2 and Python 3. But
it doesn't notice errors in "native string" to text conversions in Python 2 and
produces *false warnings* for text to "native string" conversions in Python 2.


-- 
Andrey Vlasovskikh

Web: http://pirx.ru/



More information about the Python-ideas mailing list