[Pandas-dev] Plans of future PDEPs

Jeff Reback jeffreback at gmail.com
Thu Aug 25 10:27:02 EDT 2022


duplicate columns are a fact of life and we already have a mechanism for handling these 
i am not sure you will be able to reduce much here at all - clarifying the mechanisms is probably ok

we have discussed an io plugin mechanism previously and am supportive of api to allow external development
we could and should have the current io adapters use this - but i don’t believe we can achieve much of the benefits that you list - in fact complexity is likely to go up on CI 

in any event certainly the peps are the place to discuss this

> On Aug 25, 2022, at 6:19 AM, Marc Garcia <garcia.marc at gmail.com> wrote:
> 
> 
> I'm planning to work on few PDEPs, and I think it's probably worth sharing the general ideas in advance, in case anyone has early feedback, wants to coordinate, or it's simply useful for others to know before the draft PDEPs are ready. Not intending to open the discussion on the details of them here, which it probably makes more sense to discuss in the draft on the PDEP when ready.
> 
> Governance: We discussed about defining exactly how a PDEP is approved while implementing them, and it's probably useful to clarify the project decision making in general. And also, I want to propose the creation of some workgroups within pandas to be more efficient managing things. Define their roles, how members are selected... Besides the existing CoC and NumFOCUS (finances) workgroups, I think it could be useful to have workgroups for the infrastructure (manage access to our servers, the github org, the distribution lists...), for the communications (tweeting, blogging, coordinate with NumFOCUS and other projects...). Maybe a release workgroup, and a steering committee if people find them useful...
> 
> Release: I think it would be useful to have a PDEP detailing the release process. Similar to our policies doc, but being more detailed and explicit on when exactly releases happen, and what's the criteria. Not necessarily change the current release policy, but I think we can better manage expectations of both users and developers, and be more efficient in the discussions if there is less ambiguity on when releases are going to happen.
> 
> IO modules as plugins: The idea would be to create a framework to extend pandas with IO functions, and move some of the current IO adaptors (e.g. stata, spss, feather...) to third-party projects. I think this would encourage people to build more pandas IO packages (for new formats, or improved versions of the existing), it would make these modules better maintained (easier to become a maintainer, maintainers experts in the formats...), and would reduce the complexity of pandas itself, the CI, reduce significantly the number of optional dependencies, and offer users a more natural way to install things (install pandas-sas, as opposed to have read_sas failing and having to install its dependencies).
> 
> Duplicate columns: pandas currently supports it, but it's not clear if there are many (or any) use cases where this is helpful, and this clearly makes pandas more complex, in its internals, and in its usability (e.g. almost all users would expect `df['my_col']` to return a Series, but no necessarily true, since there can be many columns "my_col", and it could return a DataFrame). I think it's useful to create a PDEP with advantages and actual use cases of having duplicate columns, and its cost in terms of problems and extra complexity. And make a decision on whether we want to continue supporting it.
> 
> Cheers,
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20220825/e3176998/attachment-0001.html>


More information about the Pandas-dev mailing list