[Pandas-dev] Challenges in creating public pandas typing stubs

Irv Lustig irv at princeton.com
Thu Apr 7 17:28:12 EDT 2022


All:

Apologies in advance for the long email.  I think we should have a
discussion on this topic at the next pandas dev meeting on April 13 at 2PM
Eastern time.

So there is good news and bad news.  The good news is the following:

We're now at a point where the Microsoft typing stubs at
https://github.com/microsoft/python-type-stubs/tree/main/partial/pandas and
the tests that came from the pandas-stubs project at
https://github.com/VirtusLab/pandas-stubs/tree/master/tests/snippets have
been carried over and modified to be used in the Microsoft project at
https://github.com/microsoft/python-type-stubs/tree/main/tests/pandas, with
CI set up to test those stubs using pyright, mypy, and pytest.

Those stubs are now in pylance 2022.4.0 that was released yesterday, and
I've been using some code from our projects at my company to help determine
where things were missing in those stubs, adding to the stubs and creating
appropriate tests to get to the current version.  I'm sure they are not
complete with respect to all the pandas methods, but we are covering a lot
of typical use cases, in my opinion.

So now the bad news....

The problem that I'm facing is how to migrate the work done there over to
the pandas project.  I thought this would be easy to do in some incremental
fashion, but I've been unable to figure out a way to do that.  The issues
are as follows:
1) Any types in the PYI files have to match what is in the source code.
For the MS stubs, this is sometimes not the case.  (See below for an
example)
2) mypy will first look in the PYI files for typing, but when typing
doesn't exist, it will look in the source code.  There are places where the
type declarations in the PYI files exist for classes and methods that are
not typed in the source code.  That creates a huge number of mypy failures
because of this inconsistency.
3) The MS stubs make the Series class generic.  Users don't have to use
that, but it creates some nice features where you can figure out that
`Series[Timestamp].__sub__(Timestamp) -> Series[Timedelta]` .   We could
decide to remove that, although I have found it to be useful in my
company's projects.
4) pandas/_typing.py in the source code and pandas/_typing.pyi from the
stubs have some differences, since they evolved differently over time.
They could probably be made consistent, but they are used in a different
way for "internal" typing checks and "public" typing checks.

As an example of the type matching, consider the method `DataFrame.any()`
and `Series.any()`.  For this method, based on the parameter `level`, we
know that it will respectively return a `DataFrame` or a `Series` if the
calling class is a DataFrame, and will return a Series or a scalar if the
calling class is a Series.  In the code, `DataFrame.any()` and
`Series.any()` share the same declaration and implementation in
`generic.py` via `NDFrame.any()`.  To accomplish the proper return typing
for users in the MS stubs, we placed overloads for `any()` in frame.pyi and
series.pyi .  That's a mismatch to the implementation.  There are probably
a lot more examples like this.

Another example relates to `DataFrame.__getitem__()` which is not possible
to statically type because if you pass a string, and the underlying
DataFrame has duplicate column names corresponding to that string, you get
a DataFrame as a result, but if the column is uniquely named within the
DataFrame, you get a Series.  Asking users to always use `cast` to convert
the result of `df["abc"]` would make the typing stubs non-friendly and not
very useful.

So how do we move forward?  To be honest, I'm not sure, which is why we
should discuss this.  Some ideas that I have are:
A) Let's not manage the public facing stubs as part of the pandas project,
and have a separate pandas-stubs project that we manage, using the MS stubs
as a starting point.  These represent the "public" API, are separately
type-checked from the source code, and can evolve separately from the
regular development code.  They can also represent the most common ways
that people use the pandas API, essentially defining a statically typed API
representing the most common use cases.  If people want to use mypy or
pyright or any other type checker, then they just install that package and
get typing support.
B) Move all type declarations out of the "py" files into "pyi" files.  I
think this is what numpy did (e.g., see numpy/core/numeric.py and
numpy/core/numeric.pyi).  Advantage here is that we then don't have to
worry about typing issues in the python code - just the PYI files, and that
could serve as a new basis for stubs for users.  But that doesn't solve the
issue of things like `NDFrame.any()` described above.  There could be an
advantage to having all type declarations only appear in PYI files, anyway
in terms of our code maintenance.
C) Create a "new" public API that lives in `pandas.api.typing`, and if you
want to use typing, you do `import pandas.api.typing as pd` , then use
`pd.Series` and `pd.DataFrame`, etc., which acts as a set of wrappers
around the current implementation.  So if you want to have type checking,
you use the same code as you do today, but just change what is imported as
"pd" to point to the typed API.

There may be some other alternatives.  There may also be some way to
migrate the MS stubs over, but I don't really have that much time to figure
that out.

Fundamentally, pandas uses a lot of dynamic typing under the hood to make
it work.  We then have been incrementally adding type declarations, making
them as precise as possible (not too narrow, not too wide), to support
development of the source code.  But I think that to support users of
pandas, we need to come up with a statically typed API, and just punt on
the cases that correspond to unusual usage.  I like the numpy strategy
where they write:

NumPy is very flexible. Trying to describe the full range of possibilities
statically would result in types that are not very helpful. For that
reason, the typed NumPy API is often stricter than the runtime NumPy API.

I think we need to keep this philosophy in mind as we make a decision as to
what's right for pandas.


@Dr-Irv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220407/01d94c37/attachment.html>


More information about the Pandas-dev mailing list