From simonjayhawkins at gmail.com Mon Apr 4 11:21:13 2022 From: simonjayhawkins at gmail.com (Simon Hawkins) Date: Mon, 4 Apr 2022 16:21:13 +0100 Subject: [Pandas-dev] ANN: pandas v1.4.2 Message-ID: Hi all, I'm pleased to announce the release of pandas v1.4.2. This is a patch release in the 1.4.x series and includes some regression fixes and bug fixes. We recommend that all users upgrade to this version. See the release notes for a list of all the changes. The release can be installed from PyPI python -m pip install --upgrade pandas==1.4.2 Or from conda-forge conda install -c conda-forge pandas==1.4.2 Please report any issues with the release on the pandas issue tracker . Thanks to all the contributors who made this release possible. -------------- next part -------------- An HTML attachment was scrubbed... URL: From raulpatricio88 at gmail.com Tue Apr 5 15:36:45 2022 From: raulpatricio88 at gmail.com (Raul Escobar) Date: Tue, 5 Apr 2022 15:36:45 -0400 Subject: [Pandas-dev] Problem while deleting blank spaces of a column of a DataFrame Message-ID: The .replace() method doesn't work during the process of replacing multiple blank spaces of a column while I'm creating an .xlsx file from a DataFrame in Pandas. I also tried .str.strip(), it deletes the blank spaces but it also deletes all the cells of the column. I also used regex=True into the .replace() method but it still doesn't work. Here's the code I'm using: import pandas as pd from openpyxl import Workbook book = Workbook() operacional_1100 = book.active maestro = pd.read_excel("2021 Gastos Ortodontik.xlsx", sheet_name="MAESTRO TR") df_ordenar = maestro.iloc[:, [0,1,2,3,4]] df_ordenar2 = df_ordenar['Monto'].replace(' ', '') escrito = pd.ExcelWriter('prueba.xlsx') df_ordenar2.to_excel(escrito) escrito.save() thank you -- *Raul Escobar Mella* *Freelance Programmer* *Programmer Analyst student at Duoc Uc Maip?* *Network Management at Duoc Uc* *Santiago, Chile.* *ID: 16.611.316-4* *raulpatricio88 at gmail.com * Libre de virus. www.avg.com <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> -------------- next part -------------- An HTML attachment was scrubbed... URL: From garcia.marc at gmail.com Tue Apr 5 16:58:52 2022 From: garcia.marc at gmail.com (Marc Garcia) Date: Tue, 5 Apr 2022 21:58:52 +0100 Subject: [Pandas-dev] Problem while deleting blank spaces of a column of a DataFrame In-Reply-To: References: Message-ID: Thanks Raul. This list is to discuss about the development of pandas itself. I'd suggest to open a question in stackoverflow with your problem. And if the problem happen to be in pandas, not in your code, then you can create an issue in the pandas github repository. On Tue, 5 Apr 2022, 21:14 Raul Escobar, wrote: > The .replace() method doesn't work during the process of replacing > multiple blank spaces of a column while I'm creating an .xlsx file from a > DataFrame in Pandas. I also tried .str.strip(), it deletes the blank spaces > but it also deletes all the cells of the column. I also used regex=True > into the .replace() method but it still doesn't work. Here's the code I'm > using: > > import pandas as pd > > from openpyxl import Workbook > > book = Workbook() > > operacional_1100 = book.active > > maestro = pd.read_excel("2021 Gastos Ortodontik.xlsx", sheet_name="MAESTRO > TR") > > df_ordenar = maestro.iloc[:, [0,1,2,3,4]] > > df_ordenar2 = df_ordenar['Monto'].replace(' ', '') > > escrito = pd.ExcelWriter('prueba.xlsx') > > df_ordenar2.to_excel(escrito) > > escrito.save() > > > > thank you > > -- > *Raul Escobar Mella* > *Freelance Programmer* > *Programmer Analyst student at Duoc Uc Maip?* > > *Network Management at Duoc Uc* > > *Santiago, Chile.* > *ID: 16.611.316-4* > *raulpatricio88 at gmail.com * > > > Libre > de virus. www.avg.com > > <#m_5285418145691884978_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reto at labrat.space Tue Apr 5 16:54:55 2022 From: reto at labrat.space (Reto) Date: Tue, 5 Apr 2022 22:54:55 +0200 Subject: [Pandas-dev] Problem while deleting blank spaces of a column of a DataFrame In-Reply-To: References: Message-ID: <20220405205455.wsvwdtvtfwxpfqtu@feather> On Tue, Apr 05, 2022 at 03:36:45PM -0400, Raul Escobar wrote: > The .replace() method doesn't work during the process of replacing multiple > blank spaces of a column while I'm creating an .xlsx file from a DataFrame > in Pandas. I also tried .str.strip(), it deletes the blank spaces but it > also deletes all the cells of the column. I also used regex=True into the > .replace() method but it still doesn't work. Here's the code I'm using: > > import pandas as pd > > from openpyxl import Workbook > > book = Workbook() > > operacional_1100 = book.active > > maestro = pd.read_excel("2021 Gastos Ortodontik.xlsx", sheet_name="MAESTRO > TR") > > df_ordenar = maestro.iloc[:, [0,1,2,3,4]] This selects multiple columns > > df_ordenar2 = df_ordenar['Monto'].replace(' ', '') This only selects the "Monto" column and discards the rest. df_ordenar2 is a pd.Series with the Monto column having replaced spaces with nothing. > escrito = pd.ExcelWriter('prueba.xlsx') > df_ordenar2.to_excel(escrito) > escrito.save() This is not needed, simply call to_excel with the filename In [15]: import pandas as pd In [16]: ser = pd.Series(['f.o ', 'fuz', ' blah ']) In [17]: replaced = ser.str.replace(' ', '') In [18]: replaced Out[18]: 0 f.o 1 fuz 2 blah dtype: object In [20]: ser Out[20]: 0 f.o 1 fuz 2 blah dtype: object From irv at princeton.com Thu Apr 7 17:28:12 2022 From: irv at princeton.com (Irv Lustig) Date: Thu, 7 Apr 2022 17:28:12 -0400 Subject: [Pandas-dev] Challenges in creating public pandas typing stubs Message-ID: All: Apologies in advance for the long email. I think we should have a discussion on this topic at the next pandas dev meeting on April 13 at 2PM Eastern time. So there is good news and bad news. The good news is the following: We're now at a point where the Microsoft typing stubs at https://github.com/microsoft/python-type-stubs/tree/main/partial/pandas and the tests that came from the pandas-stubs project at https://github.com/VirtusLab/pandas-stubs/tree/master/tests/snippets have been carried over and modified to be used in the Microsoft project at https://github.com/microsoft/python-type-stubs/tree/main/tests/pandas, with CI set up to test those stubs using pyright, mypy, and pytest. Those stubs are now in pylance 2022.4.0 that was released yesterday, and I've been using some code from our projects at my company to help determine where things were missing in those stubs, adding to the stubs and creating appropriate tests to get to the current version. I'm sure they are not complete with respect to all the pandas methods, but we are covering a lot of typical use cases, in my opinion. So now the bad news.... The problem that I'm facing is how to migrate the work done there over to the pandas project. I thought this would be easy to do in some incremental fashion, but I've been unable to figure out a way to do that. The issues are as follows: 1) Any types in the PYI files have to match what is in the source code. For the MS stubs, this is sometimes not the case. (See below for an example) 2) mypy will first look in the PYI files for typing, but when typing doesn't exist, it will look in the source code. There are places where the type declarations in the PYI files exist for classes and methods that are not typed in the source code. That creates a huge number of mypy failures because of this inconsistency. 3) The MS stubs make the Series class generic. Users don't have to use that, but it creates some nice features where you can figure out that `Series[Timestamp].__sub__(Timestamp) -> Series[Timedelta]` . We could decide to remove that, although I have found it to be useful in my company's projects. 4) pandas/_typing.py in the source code and pandas/_typing.pyi from the stubs have some differences, since they evolved differently over time. They could probably be made consistent, but they are used in a different way for "internal" typing checks and "public" typing checks. As an example of the type matching, consider the method `DataFrame.any()` and `Series.any()`. For this method, based on the parameter `level`, we know that it will respectively return a `DataFrame` or a `Series` if the calling class is a DataFrame, and will return a Series or a scalar if the calling class is a Series. In the code, `DataFrame.any()` and `Series.any()` share the same declaration and implementation in `generic.py` via `NDFrame.any()`. To accomplish the proper return typing for users in the MS stubs, we placed overloads for `any()` in frame.pyi and series.pyi . That's a mismatch to the implementation. There are probably a lot more examples like this. Another example relates to `DataFrame.__getitem__()` which is not possible to statically type because if you pass a string, and the underlying DataFrame has duplicate column names corresponding to that string, you get a DataFrame as a result, but if the column is uniquely named within the DataFrame, you get a Series. Asking users to always use `cast` to convert the result of `df["abc"]` would make the typing stubs non-friendly and not very useful. So how do we move forward? To be honest, I'm not sure, which is why we should discuss this. Some ideas that I have are: A) Let's not manage the public facing stubs as part of the pandas project, and have a separate pandas-stubs project that we manage, using the MS stubs as a starting point. These represent the "public" API, are separately type-checked from the source code, and can evolve separately from the regular development code. They can also represent the most common ways that people use the pandas API, essentially defining a statically typed API representing the most common use cases. If people want to use mypy or pyright or any other type checker, then they just install that package and get typing support. B) Move all type declarations out of the "py" files into "pyi" files. I think this is what numpy did (e.g., see numpy/core/numeric.py and numpy/core/numeric.pyi). Advantage here is that we then don't have to worry about typing issues in the python code - just the PYI files, and that could serve as a new basis for stubs for users. But that doesn't solve the issue of things like `NDFrame.any()` described above. There could be an advantage to having all type declarations only appear in PYI files, anyway in terms of our code maintenance. C) Create a "new" public API that lives in `pandas.api.typing`, and if you want to use typing, you do `import pandas.api.typing as pd` , then use `pd.Series` and `pd.DataFrame`, etc., which acts as a set of wrappers around the current implementation. So if you want to have type checking, you use the same code as you do today, but just change what is imported as "pd" to point to the typed API. There may be some other alternatives. There may also be some way to migrate the MS stubs over, but I don't really have that much time to figure that out. Fundamentally, pandas uses a lot of dynamic typing under the hood to make it work. We then have been incrementally adding type declarations, making them as precise as possible (not too narrow, not too wide), to support development of the source code. But I think that to support users of pandas, we need to come up with a statically typed API, and just punt on the cases that correspond to unusual usage. I like the numpy strategy where they write: NumPy is very flexible. Trying to describe the full range of possibilities statically would result in types that are not very helpful. For that reason, the typed NumPy API is often stricter than the runtime NumPy API. I think we need to keep this philosophy in mind as we make a decision as to what's right for pandas. @Dr-Irv -------------- next part -------------- An HTML attachment was scrubbed... URL: From simonjayhawkins at gmail.com Fri Apr 8 13:34:06 2022 From: simonjayhawkins at gmail.com (Simon Hawkins) Date: Fri, 8 Apr 2022 18:34:06 +0100 Subject: [Pandas-dev] Challenges in creating public pandas typing stubs In-Reply-To: References: Message-ID: Thanks Irv for the update and all the work done here. Yes, let's discuss this at the next dev meeting to help keep the momentum going and help resolve the issues in migrating these stubs into pandas. I won't respond to the technical issues here or discuss in to much detail in this thread save the following points A) Let's not manage the public facing stubs as part of the pandas project, > and have a separate pandas-stubs project that we manage, using the MS stubs > as a starting point. > Originally it was decided that this would be a maintenance burden and may lead to inconsistencies. I think it is fine to revisit this in light of a couple of years of lessons learnt and also that there is now also a public api typing testing framework that we may be able to reduce (eliminate) the inconsistencies if the same tests are run on the pandas codes and the pandas stubs. Move all type declarations out of the "py" files into "pyi" files. I think > this is what numpy did... > We basically now do this for our cython code. We have pyi files that we manually maintain. We don't enforce using PEP 484 style annotations in the Cython code. Admittedly this was because of the need for the types for the lower level library functions to make progress on typing the Python codebase and there are alternatives here that were not at the time mature (again may need to revisit) such as generating stubs from Cython code or type checking Cython code (say using the pure Python mode of Cython) But for our Python code, our mix of pure python to compiled code is different to Numpy so I'm not sure that comparing to the Numpy project is appropropriate. We then have been incrementally adding type declarations, making them as > precise as possible (not too narrow, not too wide), to support development > of the source code. > I think that for the pandas public api, we have already been matching the docstrings as much as possible and being fairly strict on what types are accepted but some docstrings use the terms list-like, array-like, dict-like which by definition allow a wider range of types to be accepted. Because many of the existing in-line type annotations were added when we needed to support older versions of Python (this is not a restriction for stubs) then it could well be that many of these annotations do need to be reviewed. For that reason, the typed NumPy API is often stricter than the runtime > NumPy API. > Again, we already do this where we can, i.e. we omit deprecated behaviour in overloads. We could maybe extend this to other function parameters but I don't think we can do this for return types (see next point). and just punt on the cases that correspond to unusual usage. The perception of usefulness of types is different for different users. For instance, a library developer who is using typing to make their code more robust does need to know all the possible return types to be able code for these cases and prevent bugs in their code. (e.g. if they could get a NAT returned from a datetime constructor, they need to know this) Those stubs are now in pylance 2022.4.0 that was released yesterday, great! @simonjayhawkins On Thu, 7 Apr 2022 at 22:28, Irv Lustig wrote: > All: > > Apologies in advance for the long email. I think we should have a > discussion on this topic at the next pandas dev meeting on April 13 at 2PM > Eastern time. > > So there is good news and bad news. The good news is the following: > > We're now at a point where the Microsoft typing stubs at > https://github.com/microsoft/python-type-stubs/tree/main/partial/pandas > and the tests that came from the pandas-stubs project at > https://github.com/VirtusLab/pandas-stubs/tree/master/tests/snippets have > been carried over and modified to be used in the Microsoft project at > https://github.com/microsoft/python-type-stubs/tree/main/tests/pandas, > with CI set up to test those stubs using pyright, mypy, and pytest. > > Those stubs are now in pylance 2022.4.0 that was released yesterday, and > I've been using some code from our projects at my company to help determine > where things were missing in those stubs, adding to the stubs and creating > appropriate tests to get to the current version. I'm sure they are not > complete with respect to all the pandas methods, but we are covering a lot > of typical use cases, in my opinion. > > So now the bad news.... > > The problem that I'm facing is how to migrate the work done there over to > the pandas project. I thought this would be easy to do in some incremental > fashion, but I've been unable to figure out a way to do that. The issues > are as follows: > 1) Any types in the PYI files have to match what is in the source code. > For the MS stubs, this is sometimes not the case. (See below for an > example) > 2) mypy will first look in the PYI files for typing, but when typing > doesn't exist, it will look in the source code. There are places where the > type declarations in the PYI files exist for classes and methods that are > not typed in the source code. That creates a huge number of mypy failures > because of this inconsistency. > 3) The MS stubs make the Series class generic. Users don't have to use > that, but it creates some nice features where you can figure out that > `Series[Timestamp].__sub__(Timestamp) -> Series[Timedelta]` . We could > decide to remove that, although I have found it to be useful in my > company's projects. > 4) pandas/_typing.py in the source code and pandas/_typing.pyi from the > stubs have some differences, since they evolved differently over time. > They could probably be made consistent, but they are used in a different > way for "internal" typing checks and "public" typing checks. > > As an example of the type matching, consider the method `DataFrame.any()` > and `Series.any()`. For this method, based on the parameter `level`, we > know that it will respectively return a `DataFrame` or a `Series` if the > calling class is a DataFrame, and will return a Series or a scalar if the > calling class is a Series. In the code, `DataFrame.any()` and > `Series.any()` share the same declaration and implementation in > `generic.py` via `NDFrame.any()`. To accomplish the proper return typing > for users in the MS stubs, we placed overloads for `any()` in frame.pyi and > series.pyi . That's a mismatch to the implementation. There are probably > a lot more examples like this. > > Another example relates to `DataFrame.__getitem__()` which is not possible > to statically type because if you pass a string, and the underlying > DataFrame has duplicate column names corresponding to that string, you get > a DataFrame as a result, but if the column is uniquely named within the > DataFrame, you get a Series. Asking users to always use `cast` to convert > the result of `df["abc"]` would make the typing stubs non-friendly and not > very useful. > > So how do we move forward? To be honest, I'm not sure, which is why we > should discuss this. Some ideas that I have are: > A) Let's not manage the public facing stubs as part of the pandas project, > and have a separate pandas-stubs project that we manage, using the MS stubs > as a starting point. These represent the "public" API, are separately > type-checked from the source code, and can evolve separately from the > regular development code. They can also represent the most common ways > that people use the pandas API, essentially defining a statically typed API > representing the most common use cases. If people want to use mypy or > pyright or any other type checker, then they just install that package and > get typing support. > B) Move all type declarations out of the "py" files into "pyi" files. I > think this is what numpy did (e.g., see numpy/core/numeric.py and > numpy/core/numeric.pyi). Advantage here is that we then don't have to > worry about typing issues in the python code - just the PYI files, and that > could serve as a new basis for stubs for users. But that doesn't solve the > issue of things like `NDFrame.any()` described above. There could be an > advantage to having all type declarations only appear in PYI files, anyway > in terms of our code maintenance. > C) Create a "new" public API that lives in `pandas.api.typing`, and if you > want to use typing, you do `import pandas.api.typing as pd` , then use > `pd.Series` and `pd.DataFrame`, etc., which acts as a set of wrappers > around the current implementation. So if you want to have type checking, > you use the same code as you do today, but just change what is imported as > "pd" to point to the typed API. > > There may be some other alternatives. There may also be some way to > migrate the MS stubs over, but I don't really have that much time to figure > that out. > > Fundamentally, pandas uses a lot of dynamic typing under the hood to make > it work. We then have been incrementally adding type declarations, making > them as precise as possible (not too narrow, not too wide), to support > development of the source code. But I think that to support users of > pandas, we need to come up with a statically typed API, and just punt on > the cases that correspond to unusual usage. I like the numpy strategy > where they write: > > NumPy is very flexible. Trying to describe the full range of possibilities > statically would result in types that are not very helpful. For that > reason, the typed NumPy API is often stricter than the runtime NumPy API. > > I think we need to keep this philosophy in mind as we make a decision as > to what's right for pandas. > > > @Dr-Irv > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From irv at princeton.com Fri Apr 8 14:54:58 2022 From: irv at princeton.com (Irv Lustig) Date: Fri, 8 Apr 2022 14:54:58 -0400 Subject: [Pandas-dev] Challenges in creating public pandas typing stubs In-Reply-To: References: Message-ID: Simon: Thanks for your response. I wrote: > and just punt on the cases that correspond to unusual usage. > > You responded: The perception of usefulness of types is different for different users. For > instance, a library developer who is using typing to make their code > more robust does need to know all the possible return types to be able code > for these cases and prevent bugs in their code. (e.g. if they could get a > NAT returned from a datetime constructor, they need to know this) When I said "unusual", I meant for end users. I agree with you that the needs of pandas developers and the needs of end users are different. I think you're more interested in the former, and I'm more interested in the latter. So hopefully we can figure out how the pandas project can provide something useful to the end user community as opposed to depending on third parties to do so, which is why I quoted that line from the numpy docs as a potential philosophy to follow. With respect to the datetime constructor, there are two ways to look at it. Right now we have an overload that looks like this (with most arguments omitted): @overload def to_datetime(arg: Union[int, float, str, datetime.datetime]) -> Timestamp: ... If a user were to have the expression `pd.to_datetime(np.nan)`, the result is `NaT`, which is not a valid `Timestamp` . But suppose we change this to: @overload def to_datetime(arg: Union[int, float, str, datetime.datetime]) -> Timestamp | NaT: ... Then, as a user, you might have to write code that looks like this as a user: mytime = cast(pd.Timestamp, pd.to_datetime("2022-04-08")) That's rather inconvenient for the most common use case. So by making the signature more "strict", we're helping what I believe to be the majority of users out. Sure, a user might have code that passed in `np.nan` to `to_datetime()` and we wouldn't catch that from a typing perspective, but on the other hand, the value `np.nan` comes in as a parameter in a mostly dynamic context. IMHO, there is only so much we can do in terms of being helpful with the typing signatures. -Irv On Fri, Apr 8, 2022 at 1:34 PM Simon Hawkins wrote: > Thanks Irv for the update and all the work done here. Yes, let's discuss > this at the next dev meeting to help keep the momentum going and help > resolve the issues in migrating these stubs into pandas. > > I won't respond to the technical issues here or discuss in to much detail > in this thread save the following points > > A) Let's not manage the public facing stubs as part of the pandas project, >> and have a separate pandas-stubs project that we manage, using the MS stubs >> as a starting point. >> > > Originally it was decided that this would be a maintenance burden and may > lead to inconsistencies. I think it is fine to revisit this in light of a > couple of years of lessons learnt and also that there is now also a public > api typing testing framework that we may be able to reduce (eliminate) the > inconsistencies if the same tests are run on the pandas codes and the > pandas stubs. > > Move all type declarations out of the "py" files into "pyi" files. I >> think this is what numpy did... >> > > We basically now do this for our cython code. We have pyi files that we > manually maintain. We don't enforce using PEP 484 style annotations in the > Cython code. Admittedly this was because of the need for the types for the > lower level library functions to make progress on typing the Python > codebase and there are alternatives here that were not at the time mature > (again may need to revisit) such as generating stubs from Cython code or > type checking Cython code (say using the pure Python mode of Cython) > > But for our Python code, our mix of pure python to compiled code is > different to Numpy so I'm not sure that comparing to the Numpy project is > appropropriate. > > We then have been incrementally adding type declarations, making them as >> precise as possible (not too narrow, not too wide), to support development >> of the source code. >> > > I think that for the pandas public api, we have already been matching the > docstrings as much as possible and being fairly strict on what types are > accepted but some docstrings use the terms list-like, array-like, dict-like > which by definition allow a wider range of types to be accepted. Because > many of the existing in-line type annotations were added when we needed to > support older versions of Python (this is not a restriction for stubs) then > it could well be that many of these annotations do need to be reviewed. > > For that reason, the typed NumPy API is often stricter than the runtime >> NumPy API. >> > > Again, we already do this where we can, i.e. we omit deprecated behaviour > in overloads. We could maybe extend this to other function parameters but I > don't think we can do this for return types (see next point). > > and just punt on the cases that correspond to unusual usage. > > > The perception of usefulness of types is different for different users. > For instance, a library developer who is using typing to make their code > more robust does need to know all the possible return types to be able code > for these cases and prevent bugs in their code. (e.g. if they could get a > NAT returned from a datetime constructor, they need to know this) > > Those stubs are now in pylance 2022.4.0 that was released yesterday, > > > great! > > > @simonjayhawkins > > On Thu, 7 Apr 2022 at 22:28, Irv Lustig wrote: > >> All: >> >> Apologies in advance for the long email. I think we should have a >> discussion on this topic at the next pandas dev meeting on April 13 at 2PM >> Eastern time. >> >> So there is good news and bad news. The good news is the following: >> >> We're now at a point where the Microsoft typing stubs at >> https://github.com/microsoft/python-type-stubs/tree/main/partial/pandas >> and the tests that came from the pandas-stubs project at >> https://github.com/VirtusLab/pandas-stubs/tree/master/tests/snippets >> have been carried over and modified to be used in the Microsoft project at >> https://github.com/microsoft/python-type-stubs/tree/main/tests/pandas, >> with CI set up to test those stubs using pyright, mypy, and pytest. >> >> Those stubs are now in pylance 2022.4.0 that was released yesterday, and >> I've been using some code from our projects at my company to help determine >> where things were missing in those stubs, adding to the stubs and creating >> appropriate tests to get to the current version. I'm sure they are not >> complete with respect to all the pandas methods, but we are covering a lot >> of typical use cases, in my opinion. >> >> So now the bad news.... >> >> The problem that I'm facing is how to migrate the work done there over to >> the pandas project. I thought this would be easy to do in some incremental >> fashion, but I've been unable to figure out a way to do that. The issues >> are as follows: >> 1) Any types in the PYI files have to match what is in the source code. >> For the MS stubs, this is sometimes not the case. (See below for an >> example) >> 2) mypy will first look in the PYI files for typing, but when typing >> doesn't exist, it will look in the source code. There are places where the >> type declarations in the PYI files exist for classes and methods that are >> not typed in the source code. That creates a huge number of mypy failures >> because of this inconsistency. >> 3) The MS stubs make the Series class generic. Users don't have to use >> that, but it creates some nice features where you can figure out that >> `Series[Timestamp].__sub__(Timestamp) -> Series[Timedelta]` . We could >> decide to remove that, although I have found it to be useful in my >> company's projects. >> 4) pandas/_typing.py in the source code and pandas/_typing.pyi from the >> stubs have some differences, since they evolved differently over time. >> They could probably be made consistent, but they are used in a different >> way for "internal" typing checks and "public" typing checks. >> >> As an example of the type matching, consider the method `DataFrame.any()` >> and `Series.any()`. For this method, based on the parameter `level`, we >> know that it will respectively return a `DataFrame` or a `Series` if the >> calling class is a DataFrame, and will return a Series or a scalar if the >> calling class is a Series. In the code, `DataFrame.any()` and >> `Series.any()` share the same declaration and implementation in >> `generic.py` via `NDFrame.any()`. To accomplish the proper return typing >> for users in the MS stubs, we placed overloads for `any()` in frame.pyi and >> series.pyi . That's a mismatch to the implementation. There are probably >> a lot more examples like this. >> >> Another example relates to `DataFrame.__getitem__()` which is not >> possible to statically type because if you pass a string, and the >> underlying DataFrame has duplicate column names corresponding to that >> string, you get a DataFrame as a result, but if the column is uniquely >> named within the DataFrame, you get a Series. Asking users to always use >> `cast` to convert the result of `df["abc"]` would make the typing stubs >> non-friendly and not very useful. >> >> So how do we move forward? To be honest, I'm not sure, which is why we >> should discuss this. Some ideas that I have are: >> A) Let's not manage the public facing stubs as part of the pandas >> project, and have a separate pandas-stubs project that we manage, using the >> MS stubs as a starting point. These represent the "public" API, are >> separately type-checked from the source code, and can evolve separately >> from the regular development code. They can also represent the most common >> ways that people use the pandas API, essentially defining a statically >> typed API representing the most common use cases. If people want to use >> mypy or pyright or any other type checker, then they just install that >> package and get typing support. >> B) Move all type declarations out of the "py" files into "pyi" files. I >> think this is what numpy did (e.g., see numpy/core/numeric.py and >> numpy/core/numeric.pyi). Advantage here is that we then don't have to >> worry about typing issues in the python code - just the PYI files, and that >> could serve as a new basis for stubs for users. But that doesn't solve the >> issue of things like `NDFrame.any()` described above. There could be an >> advantage to having all type declarations only appear in PYI files, anyway >> in terms of our code maintenance. >> C) Create a "new" public API that lives in `pandas.api.typing`, and if >> you want to use typing, you do `import pandas.api.typing as pd` , then use >> `pd.Series` and `pd.DataFrame`, etc., which acts as a set of wrappers >> around the current implementation. So if you want to have type checking, >> you use the same code as you do today, but just change what is imported as >> "pd" to point to the typed API. >> >> There may be some other alternatives. There may also be some way to >> migrate the MS stubs over, but I don't really have that much time to figure >> that out. >> >> Fundamentally, pandas uses a lot of dynamic typing under the hood to make >> it work. We then have been incrementally adding type declarations, making >> them as precise as possible (not too narrow, not too wide), to support >> development of the source code. But I think that to support users of >> pandas, we need to come up with a statically typed API, and just punt on >> the cases that correspond to unusual usage. I like the numpy strategy >> where they write: >> >> NumPy is very flexible. Trying to describe the full range of >> possibilities statically would result in types that are not very helpful. >> For that reason, the typed NumPy API is often stricter than the runtime >> NumPy API. >> >> I think we need to keep this philosophy in mind as we make a decision as >> to what's right for pandas. >> >> >> @Dr-Irv >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From maartenb at xs4all.nl Sat Apr 9 12:58:31 2022 From: maartenb at xs4all.nl (Maarten Ballintijn) Date: Sat, 9 Apr 2022 12:58:31 -0400 Subject: [Pandas-dev] Challenges in creating public pandas typing stubs In-Reply-To: References: Message-ID: <98CE08FB-B979-41D5-9692-614A72BFBB55@xs4all.nl> Part of the problem appears to be the fact that NaT is it?s own type. The difficulty with the typing results from that in ways which do not occur with NaN's (And the cast is not really a solution either.) I think it would be preferable to have two conversion APIs. One checked (throwing on error) for case that "should never fail?. And one with an inline error value (NaT, like NaN) where the user is responsible for dealing with the error at some point. Proper typing becomes much easier in that case. > On Apr 8, 2022, at 2:54 PM, Irv Lustig wrote: > > With respect to the datetime constructor, there are two ways to look at it. Right now we have an overload that looks like this (with most arguments omitted): > > @overload > def to_datetime(arg: Union[int, float, str, datetime.datetime]) -> Timestamp: ... > > If a user were to have the expression `pd.to_datetime(np.nan)`, the result is `NaT`, which is not a valid `Timestamp` . But suppose we change this to: > > @overload > def to_datetime(arg: Union[int, float, str, datetime.datetime]) -> Timestamp | NaT: ... > > Then, as a user, you might have to write code that looks like this as a user: > mytime = cast(pd.Timestamp, pd.to_datetime("2022-04-08")) > > That's rather inconvenient for the most common use case. So by making the signature more "strict", we're helping what I believe to be the majority of users out. Sure, a user might have code that passed in `np.nan` to `to_datetime()` and we wouldn't catch that from a typing perspective, but on the other hand, the value `np.nan` comes in as a parameter in a mostly dynamic context. IMHO, there is only so much we can do in terms of being helpful with the typing signatures. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From irv at princeton.com Mon Apr 11 09:42:03 2022 From: irv at princeton.com (Irv Lustig) Date: Mon, 11 Apr 2022 09:42:03 -0400 Subject: [Pandas-dev] Challenges in creating public pandas typing stubs In-Reply-To: <98CE08FB-B979-41D5-9692-614A72BFBB55@xs4all.nl> References: <98CE08FB-B979-41D5-9692-614A72BFBB55@xs4all.nl> Message-ID: Maarten: In terms of the two conversion APIs, we have that to some extent (at least with respect to `pd.to_datetime()` with the `errors` parameter: errors{?ignore?, ?raise?, ?coerce?}, default ?raise? - If 'raise', then invalid parsing will raise an exception. - If 'coerce', then invalid parsing will be set as NaT. - If 'ignore', then invalid parsing will return the input. The issue here is that "parsing" `np.NaN` does not raise. -Irv On Sat, Apr 9, 2022 at 12:58 PM Maarten Ballintijn wrote: > > Part of the problem appears to be the fact that NaT is it?s own type. > The difficulty with the typing results from that in ways which do not > occur with NaN's > (And the cast is not really a solution either.) > > I think it would be preferable to have two conversion APIs. > > > - One checked (throwing on error) for case that "should never fail?. > > > > - And one with an inline error value (NaT, like NaN) where the user is > responsible for dealing with the error at some point. > > > Proper typing becomes much easier in that case. > > > > On Apr 8, 2022, at 2:54 PM, Irv Lustig wrote: > > With respect to the datetime constructor, there are two ways to look at > it. Right now we have an overload that looks like this (with most > arguments omitted): > > @overload > def to_datetime(arg: Union[int, float, str, datetime.datetime]) -> > Timestamp: ... > > If a user were to have the expression `pd.to_datetime(np.nan)`, the > result is `NaT`, which is not a valid `Timestamp` . But suppose we change > this to: > > @overload > def to_datetime(arg: Union[int, float, str, datetime.datetime]) -> > Timestamp | NaT: ... > > Then, as a user, you might have to write code that looks like this as a > user: > mytime = cast(pd.Timestamp, pd.to_datetime("2022-04-08")) > > That's rather inconvenient for the most common use case. So by making the > signature more "strict", we're helping what I believe to be the majority of > users out. Sure, a user might have code that passed in `np.nan` to > `to_datetime()` and we wouldn't catch that from a typing perspective, but > on the other hand, the value `np.nan` comes in as a parameter in a mostly > dynamic context. IMHO, there is only so much we can do in terms of being > helpful with the typing signatures. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From irv at princeton.com Wed Apr 13 17:42:50 2022 From: irv at princeton.com (Irv Lustig) Date: Wed, 13 Apr 2022 17:42:50 -0400 Subject: [Pandas-dev] Challenges in creating public pandas typing stubs In-Reply-To: References: Message-ID: We discussed the issue at today's pandas development meeting. Simon Hawkins, Brock Mendel, Richard Shadrach, and I agreed that option A (having a separate pandas-stubs project) would be the best way forward. We asked to get Jeff Reback's approval, and I asked him privately, and he agrees with that approach. Here's my proposal for a way forward: 1. Work with the Microsoft guys to take what they have and move it over to a new pandas-stubs repo that we'll create, and have them no longer maintain their repo, and have them modify their processes to pull our repo into theirs for pylance releases until we get our repo published on pypi . 2. Work with the VirtusLabs team to see how we get our pandas-stubs package to be on pypi to replace theirs. 3. Maybe someone else can help with getting pandas-stubs on conda-forge and/or the main anaconda channel once we get it on pypi. -Irv On Thu, Apr 7, 2022 at 5:28 PM Irv Lustig wrote: > All: > > Apologies in advance for the long email. I think we should have a > discussion on this topic at the next pandas dev meeting on April 13 at 2PM > Eastern time. > > So there is good news and bad news. The good news is the following: > > We're now at a point where the Microsoft typing stubs at > https://github.com/microsoft/python-type-stubs/tree/main/partial/pandas > and the tests that came from the pandas-stubs project at > https://github.com/VirtusLab/pandas-stubs/tree/master/tests/snippets have > been carried over and modified to be used in the Microsoft project at > https://github.com/microsoft/python-type-stubs/tree/main/tests/pandas, > with CI set up to test those stubs using pyright, mypy, and pytest. > > Those stubs are now in pylance 2022.4.0 that was released yesterday, and > I've been using some code from our projects at my company to help determine > where things were missing in those stubs, adding to the stubs and creating > appropriate tests to get to the current version. I'm sure they are not > complete with respect to all the pandas methods, but we are covering a lot > of typical use cases, in my opinion. > > So now the bad news.... > > The problem that I'm facing is how to migrate the work done there over to > the pandas project. I thought this would be easy to do in some incremental > fashion, but I've been unable to figure out a way to do that. The issues > are as follows: > 1) Any types in the PYI files have to match what is in the source code. > For the MS stubs, this is sometimes not the case. (See below for an > example) > 2) mypy will first look in the PYI files for typing, but when typing > doesn't exist, it will look in the source code. There are places where the > type declarations in the PYI files exist for classes and methods that are > not typed in the source code. That creates a huge number of mypy failures > because of this inconsistency. > 3) The MS stubs make the Series class generic. Users don't have to use > that, but it creates some nice features where you can figure out that > `Series[Timestamp].__sub__(Timestamp) -> Series[Timedelta]` . We could > decide to remove that, although I have found it to be useful in my > company's projects. > 4) pandas/_typing.py in the source code and pandas/_typing.pyi from the > stubs have some differences, since they evolved differently over time. > They could probably be made consistent, but they are used in a different > way for "internal" typing checks and "public" typing checks. > > As an example of the type matching, consider the method `DataFrame.any()` > and `Series.any()`. For this method, based on the parameter `level`, we > know that it will respectively return a `DataFrame` or a `Series` if the > calling class is a DataFrame, and will return a Series or a scalar if the > calling class is a Series. In the code, `DataFrame.any()` and > `Series.any()` share the same declaration and implementation in > `generic.py` via `NDFrame.any()`. To accomplish the proper return typing > for users in the MS stubs, we placed overloads for `any()` in frame.pyi and > series.pyi . That's a mismatch to the implementation. There are probably > a lot more examples like this. > > Another example relates to `DataFrame.__getitem__()` which is not possible > to statically type because if you pass a string, and the underlying > DataFrame has duplicate column names corresponding to that string, you get > a DataFrame as a result, but if the column is uniquely named within the > DataFrame, you get a Series. Asking users to always use `cast` to convert > the result of `df["abc"]` would make the typing stubs non-friendly and not > very useful. > > So how do we move forward? To be honest, I'm not sure, which is why we > should discuss this. Some ideas that I have are: > A) Let's not manage the public facing stubs as part of the pandas project, > and have a separate pandas-stubs project that we manage, using the MS stubs > as a starting point. These represent the "public" API, are separately > type-checked from the source code, and can evolve separately from the > regular development code. They can also represent the most common ways > that people use the pandas API, essentially defining a statically typed API > representing the most common use cases. If people want to use mypy or > pyright or any other type checker, then they just install that package and > get typing support. > B) Move all type declarations out of the "py" files into "pyi" files. I > think this is what numpy did (e.g., see numpy/core/numeric.py and > numpy/core/numeric.pyi). Advantage here is that we then don't have to > worry about typing issues in the python code - just the PYI files, and that > could serve as a new basis for stubs for users. But that doesn't solve the > issue of things like `NDFrame.any()` described above. There could be an > advantage to having all type declarations only appear in PYI files, anyway > in terms of our code maintenance. > C) Create a "new" public API that lives in `pandas.api.typing`, and if you > want to use typing, you do `import pandas.api.typing as pd` , then use > `pd.Series` and `pd.DataFrame`, etc., which acts as a set of wrappers > around the current implementation. So if you want to have type checking, > you use the same code as you do today, but just change what is imported as > "pd" to point to the typed API. > > There may be some other alternatives. There may also be some way to > migrate the MS stubs over, but I don't really have that much time to figure > that out. > > Fundamentally, pandas uses a lot of dynamic typing under the hood to make > it work. We then have been incrementally adding type declarations, making > them as precise as possible (not too narrow, not too wide), to support > development of the source code. But I think that to support users of > pandas, we need to come up with a statically typed API, and just punt on > the cases that correspond to unusual usage. I like the numpy strategy > where they write: > > NumPy is very flexible. Trying to describe the full range of possibilities > statically would result in types that are not very helpful. For that > reason, the typed NumPy API is often stricter than the runtime NumPy API. > > I think we need to keep this philosophy in mind as we make a decision as > to what's right for pandas. > > > @Dr-Irv > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: