[Pandas-dev] Challenges in creating public pandas typing stubs

Wed Apr 13 17:42:50 EDT 2022

We discussed the issue at today's pandas development meeting.  Simon
Hawkins, Brock Mendel, Richard Shadrach, and I agreed that option A (having
a separate pandas-stubs project) would be the best way forward.  We asked
to get Jeff Reback's approval, and I asked him privately, and he agrees
with that approach.

Here's my proposal for a way forward:
1. Work with the Microsoft guys to take what they have and move it over to
a new pandas-stubs repo that we'll create, and have them no longer maintain
their repo, and have them modify their processes to pull our repo into
theirs for pylance releases until we get our repo published on pypi .
2. Work with the VirtusLabs team to see how we get our pandas-stubs package
to be on pypi to replace theirs.
3. Maybe someone else can help with getting pandas-stubs on conda-forge
and/or the main anaconda channel once we get it on pypi.

-Irv

On Thu, Apr 7, 2022 at 5:28 PM Irv Lustig <irv at princeton.com> wrote:

> All:
>
> Apologies in advance for the long email.  I think we should have a
> discussion on this topic at the next pandas dev meeting on April 13 at 2PM
> Eastern time.
>
> So there is good news and bad news.  The good news is the following:
>
> We're now at a point where the Microsoft typing stubs at
> https://github.com/microsoft/python-type-stubs/tree/main/partial/pandas
> and the tests that came from the pandas-stubs project at
> https://github.com/VirtusLab/pandas-stubs/tree/master/tests/snippets have
> been carried over and modified to be used in the Microsoft project at
> https://github.com/microsoft/python-type-stubs/tree/main/tests/pandas,
> with CI set up to test those stubs using pyright, mypy, and pytest.
>
> Those stubs are now in pylance 2022.4.0 that was released yesterday, and
> I've been using some code from our projects at my company to help determine
> where things were missing in those stubs, adding to the stubs and creating
> appropriate tests to get to the current version.  I'm sure they are not
> complete with respect to all the pandas methods, but we are covering a lot
> of typical use cases, in my opinion.
>
> So now the bad news....
>
> The problem that I'm facing is how to migrate the work done there over to
> the pandas project.  I thought this would be easy to do in some incremental
> fashion, but I've been unable to figure out a way to do that.  The issues
> are as follows:
> 1) Any types in the PYI files have to match what is in the source code.
> For the MS stubs, this is sometimes not the case.  (See below for an
> example)
> 2) mypy will first look in the PYI files for typing, but when typing
> doesn't exist, it will look in the source code.  There are places where the
> type declarations in the PYI files exist for classes and methods that are
> not typed in the source code.  That creates a huge number of mypy failures
> because of this inconsistency.
> 3) The MS stubs make the Series class generic.  Users don't have to use
> that, but it creates some nice features where you can figure out that
> `Series[Timestamp].__sub__(Timestamp) -> Series[Timedelta]` .   We could
> decide to remove that, although I have found it to be useful in my
> company's projects.
> 4) pandas/_typing.py in the source code and pandas/_typing.pyi from the
> stubs have some differences, since they evolved differently over time.
> They could probably be made consistent, but they are used in a different
> way for "internal" typing checks and "public" typing checks.
>
> As an example of the type matching, consider the method `DataFrame.any()`
> and `Series.any()`.  For this method, based on the parameter `level`, we
> know that it will respectively return a `DataFrame` or a `Series` if the
> calling class is a DataFrame, and will return a Series or a scalar if the
> calling class is a Series.  In the code, `DataFrame.any()` and
> `Series.any()` share the same declaration and implementation in
> `generic.py` via `NDFrame.any()`.  To accomplish the proper return typing
> for users in the MS stubs, we placed overloads for `any()` in frame.pyi and
> series.pyi .  That's a mismatch to the implementation.  There are probably
> a lot more examples like this.
>
> Another example relates to `DataFrame.__getitem__()` which is not possible
> to statically type because if you pass a string, and the underlying
> DataFrame has duplicate column names corresponding to that string, you get
> a DataFrame as a result, but if the column is uniquely named within the
> DataFrame, you get a Series.  Asking users to always use `cast` to convert
> the result of `df["abc"]` would make the typing stubs non-friendly and not
> very useful.
>
> So how do we move forward?  To be honest, I'm not sure, which is why we
> should discuss this.  Some ideas that I have are:
> A) Let's not manage the public facing stubs as part of the pandas project,
> and have a separate pandas-stubs project that we manage, using the MS stubs
> as a starting point.  These represent the "public" API, are separately
> type-checked from the source code, and can evolve separately from the
> regular development code.  They can also represent the most common ways
> that people use the pandas API, essentially defining a statically typed API
> representing the most common use cases.  If people want to use mypy or
> pyright or any other type checker, then they just install that package and
> get typing support.
> B) Move all type declarations out of the "py" files into "pyi" files.  I
> think this is what numpy did (e.g., see numpy/core/numeric.py and
> numpy/core/numeric.pyi).  Advantage here is that we then don't have to
> worry about typing issues in the python code - just the PYI files, and that
> could serve as a new basis for stubs for users.  But that doesn't solve the
> issue of things like `NDFrame.any()` described above.  There could be an
> advantage to having all type declarations only appear in PYI files, anyway
> in terms of our code maintenance.
> C) Create a "new" public API that lives in `pandas.api.typing`, and if you
> want to use typing, you do `import pandas.api.typing as pd` , then use
> `pd.Series` and `pd.DataFrame`, etc., which acts as a set of wrappers
> around the current implementation.  So if you want to have type checking,
> you use the same code as you do today, but just change what is imported as
> "pd" to point to the typed API.
>
> There may be some other alternatives.  There may also be some way to
> migrate the MS stubs over, but I don't really have that much time to figure
> that out.
>
> Fundamentally, pandas uses a lot of dynamic typing under the hood to make
> it work.  We then have been incrementally adding type declarations, making
> them as precise as possible (not too narrow, not too wide), to support
> development of the source code.  But I think that to support users of
> pandas, we need to come up with a statically typed API, and just punt on
> the cases that correspond to unusual usage.  I like the numpy strategy
> where they write:
>
> NumPy is very flexible. Trying to describe the full range of possibilities
> statically would result in types that are not very helpful. For that
> reason, the typed NumPy API is often stricter than the runtime NumPy API.
>
> I think we need to keep this philosophy in mind as we make a decision as
> to what's right for pandas.
>
>
> @Dr-Irv
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220413/0060a0d1/attachment.html>