[Pandas-dev] Challenges in creating public pandas typing stubs

Fri Apr 8 13:34:06 EDT 2022

Thanks Irv for the update and all the work done here. Yes, let's discuss
this at the next dev meeting to help keep the momentum going and help
resolve the issues in migrating these stubs into pandas.

I won't respond to the technical issues here or discuss in to much detail
in this thread save the following points

A) Let's not manage the public facing stubs as part of the pandas project,
> and have a separate pandas-stubs project that we manage, using the MS stubs
> as a starting point.
>

Originally it was decided that this would be a maintenance burden and may
lead to inconsistencies. I think it is fine to revisit this in light of a
couple of years of lessons learnt and also that there is now also a public
api typing testing framework that we may be able to reduce (eliminate) the
inconsistencies if the same tests are run on the pandas codes and the
pandas stubs.

Move all type declarations out of the "py" files into "pyi" files.  I think
> this is what numpy did...
>

We basically now do this for our cython code. We have pyi files that we
manually maintain. We don't enforce using PEP 484 style annotations in the
Cython code. Admittedly this was because of the need for the types for the
lower level library functions to make progress on typing the Python
codebase and there are alternatives here that were not at the time mature
(again may need to revisit) such as generating stubs from Cython code or
type checking Cython code (say using the pure Python mode of Cython)

But for our Python code, our mix of pure python to compiled code is
different to Numpy so I'm not sure that comparing to the Numpy project is
appropropriate.

We then have been incrementally adding type declarations, making them as
> precise as possible (not too narrow, not too wide), to support development
> of the source code.
>

 I think that for the pandas public api, we have already been matching the
docstrings as much as possible and being fairly strict on what types are
accepted but some docstrings use the terms list-like, array-like, dict-like
which by definition allow a wider range of types to be accepted. Because
many of the existing in-line type annotations were added when we needed to
support older versions of Python (this is not a restriction for stubs) then
it could well be that many of these annotations do need to be reviewed.

For that reason, the typed NumPy API is often stricter than the runtime
> NumPy API.
>

Again, we already do this where we can, i.e. we omit deprecated behaviour
in overloads. We could maybe extend this to other function parameters but I
don't think we can do this for return types (see next point).

and just punt on the cases that correspond to unusual usage.

The perception of usefulness of types is different for different users. For
instance, a library developer who is using typing to make their code
more robust does need to know all the possible return types to be able code
for these cases and prevent bugs in their code. (e.g. if they could get a
NAT returned from a datetime constructor, they need to know this)

Those stubs are now in pylance 2022.4.0 that was released yesterday,

great!

@simonjayhawkins

On Thu, 7 Apr 2022 at 22:28, Irv Lustig <irv at princeton.com> wrote:

> All:
>
> Apologies in advance for the long email.  I think we should have a
> discussion on this topic at the next pandas dev meeting on April 13 at 2PM
> Eastern time.
>
> So there is good news and bad news.  The good news is the following:
>
> We're now at a point where the Microsoft typing stubs at
> https://github.com/microsoft/python-type-stubs/tree/main/partial/pandas
> and the tests that came from the pandas-stubs project at
> https://github.com/VirtusLab/pandas-stubs/tree/master/tests/snippets have
> been carried over and modified to be used in the Microsoft project at
> https://github.com/microsoft/python-type-stubs/tree/main/tests/pandas,
> with CI set up to test those stubs using pyright, mypy, and pytest.
>
> Those stubs are now in pylance 2022.4.0 that was released yesterday, and
> I've been using some code from our projects at my company to help determine
> where things were missing in those stubs, adding to the stubs and creating
> appropriate tests to get to the current version.  I'm sure they are not
> complete with respect to all the pandas methods, but we are covering a lot
> of typical use cases, in my opinion.
>
> So now the bad news....
>
> The problem that I'm facing is how to migrate the work done there over to
> the pandas project.  I thought this would be easy to do in some incremental
> fashion, but I've been unable to figure out a way to do that.  The issues
> are as follows:
> 1) Any types in the PYI files have to match what is in the source code.
> For the MS stubs, this is sometimes not the case.  (See below for an
> example)
> 2) mypy will first look in the PYI files for typing, but when typing
> doesn't exist, it will look in the source code.  There are places where the
> type declarations in the PYI files exist for classes and methods that are
> not typed in the source code.  That creates a huge number of mypy failures
> because of this inconsistency.
> 3) The MS stubs make the Series class generic.  Users don't have to use
> that, but it creates some nice features where you can figure out that
> `Series[Timestamp].__sub__(Timestamp) -> Series[Timedelta]` .   We could
> decide to remove that, although I have found it to be useful in my
> company's projects.
> 4) pandas/_typing.py in the source code and pandas/_typing.pyi from the
> stubs have some differences, since they evolved differently over time.
> They could probably be made consistent, but they are used in a different
> way for "internal" typing checks and "public" typing checks.
>
> As an example of the type matching, consider the method `DataFrame.any()`
> and `Series.any()`.  For this method, based on the parameter `level`, we
> know that it will respectively return a `DataFrame` or a `Series` if the
> calling class is a DataFrame, and will return a Series or a scalar if the
> calling class is a Series.  In the code, `DataFrame.any()` and
> `Series.any()` share the same declaration and implementation in
> `generic.py` via `NDFrame.any()`.  To accomplish the proper return typing
> for users in the MS stubs, we placed overloads for `any()` in frame.pyi and
> series.pyi .  That's a mismatch to the implementation.  There are probably
> a lot more examples like this.
>
> Another example relates to `DataFrame.__getitem__()` which is not possible
> to statically type because if you pass a string, and the underlying
> DataFrame has duplicate column names corresponding to that string, you get
> a DataFrame as a result, but if the column is uniquely named within the
> DataFrame, you get a Series.  Asking users to always use `cast` to convert
> the result of `df["abc"]` would make the typing stubs non-friendly and not
> very useful.
>
> So how do we move forward?  To be honest, I'm not sure, which is why we
> should discuss this.  Some ideas that I have are:
> A) Let's not manage the public facing stubs as part of the pandas project,
> and have a separate pandas-stubs project that we manage, using the MS stubs
> as a starting point.  These represent the "public" API, are separately
> type-checked from the source code, and can evolve separately from the
> regular development code.  They can also represent the most common ways
> that people use the pandas API, essentially defining a statically typed API
> representing the most common use cases.  If people want to use mypy or
> pyright or any other type checker, then they just install that package and
> get typing support.
> B) Move all type declarations out of the "py" files into "pyi" files.  I
> think this is what numpy did (e.g., see numpy/core/numeric.py and
> numpy/core/numeric.pyi).  Advantage here is that we then don't have to
> worry about typing issues in the python code - just the PYI files, and that
> could serve as a new basis for stubs for users.  But that doesn't solve the
> issue of things like `NDFrame.any()` described above.  There could be an
> advantage to having all type declarations only appear in PYI files, anyway
> in terms of our code maintenance.
> C) Create a "new" public API that lives in `pandas.api.typing`, and if you
> want to use typing, you do `import pandas.api.typing as pd` , then use
> `pd.Series` and `pd.DataFrame`, etc., which acts as a set of wrappers
> around the current implementation.  So if you want to have type checking,
> you use the same code as you do today, but just change what is imported as
> "pd" to point to the typed API.
>
> There may be some other alternatives.  There may also be some way to
> migrate the MS stubs over, but I don't really have that much time to figure
> that out.
>
> Fundamentally, pandas uses a lot of dynamic typing under the hood to make
> it work.  We then have been incrementally adding type declarations, making
> them as precise as possible (not too narrow, not too wide), to support
> development of the source code.  But I think that to support users of
> pandas, we need to come up with a statically typed API, and just punt on
> the cases that correspond to unusual usage.  I like the numpy strategy
> where they write:
>
> NumPy is very flexible. Trying to describe the full range of possibilities
> statically would result in types that are not very helpful. For that
> reason, the typed NumPy API is often stricter than the runtime NumPy API.
>
> I think we need to keep this philosophy in mind as we make a decision as
> to what's right for pandas.
>
>
> @Dr-Irv
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220408/751e24a5/attachment.html>