[Pandas-dev] Plans of future PDEPs

Marc Garcia garcia.marc at gmail.com
Thu Aug 25 06:18:44 EDT 2022


I'm planning to work on few PDEPs, and I think it's probably worth sharing
the general ideas in advance, in case anyone has early feedback, wants to
coordinate, or it's simply useful for others to know before the draft PDEPs
are ready. Not intending to open the discussion on the details of them
here, which it probably makes more sense to discuss in the draft on the
PDEP when ready.

*Governance*: We discussed about defining exactly how a PDEP is approved
while implementing them, and it's probably useful to clarify the project
decision making in general. And also, I want to propose the creation of
some workgroups within pandas to be more efficient managing things. Define
their roles, how members are selected... Besides the existing CoC and
NumFOCUS (finances) workgroups, I think it could be useful to have
workgroups for the infrastructure (manage access to our servers, the github
org, the distribution lists...), for the communications (tweeting,
blogging, coordinate with NumFOCUS and other projects...). Maybe a release
workgroup, and a steering committee if people find them useful...

*Release*: I think it would be useful to have a PDEP detailing the release
process. Similar to our policies doc
<https://pandas.pydata.org/docs/development/policies.html>, but being more
detailed and explicit on when exactly releases happen, and what's the
criteria. Not necessarily change the current release policy, but I think we
can better manage expectations of both users and developers, and be more
efficient in the discussions if there is less ambiguity on when releases
are going to happen.

*IO modules as plugins*: The idea would be to create a framework to extend
pandas with IO functions, and move some of the current IO adaptors (e.g.
stata, spss, feather...) to third-party projects. I think this would
encourage people to build more pandas IO packages (for new formats, or
improved versions of the existing), it would make these modules better
maintained (easier to become a maintainer, maintainers experts in the
formats...), and would reduce the complexity of pandas itself, the CI,
reduce significantly the number of optional dependencies, and offer users a
more natural way to install things (install pandas-sas, as opposed to have
read_sas failing and having to install its dependencies).

*Duplicate columns*: pandas currently supports it, but it's not clear if
there are many (or any) use cases where this is helpful, and this clearly
makes pandas more complex, in its internals, and in its usability (e.g.
almost all users would expect `df['my_col']` to return a Series, but no
necessarily true, since there can be many columns "my_col", and it could
return a DataFrame). I think it's useful to create a PDEP with advantages
and actual use cases of having duplicate columns, and its cost in terms of
problems and extra complexity. And make a decision on whether we want to
continue supporting it.

Cheers,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220825/073c0469/attachment.html>


More information about the Pandas-dev mailing list