From garcia.marc at gmail.com Mon Mar 5 15:55:41 2018 From: garcia.marc at gmail.com (Marc Garcia) Date: Mon, 5 Mar 2018 20:55:41 +0000 Subject: [Pandas-dev] Open questions regarding docstrings Message-ID: Hi there, There are few things regarding the docstrings, that are still open to discussion, in many cases because the numpy convention (or numpy doc examples) is different for the unwritten convention used in most pandas docstrings. Probably you've seen the discussion in GitHub, but I list them here, with the proposed decision (mainly keep the pandas way). If anyone disagrees in any point, please let us know, so we'll change the documentation for the sprint, and do it in the desired way. 1) Starting the docstring just after the opening triple quotes, or in the next line. In pandas it's more common to do it in the next line, so we'll keep it this way. 2) For parameters, showing the default value after the type, or after the description. Numpy does not find it necessary to specify them, and it specified the recommended place is after the description. The proposal (mainly by Joris) is to always have them and after the type, as it's easier to see it. 3) For parameters expecting a string, in the numpy convention examples `str` is used, the proposal is to use `string` instead. 4) For complex types like dicts, I think there is some consensus that is easier to understand the types if using brackets (e.g. "dict of {str: int}" over "dict of str: int"). And same for tuples (e.g. "tuple of (int, str, int)" over "tuple of int, str, int"). For list and sets, the type is simpler (e.g. "list of int" or "set of str"). I propose to use the brackets for list and tuple, and not for list and set, and use `str` over `string` if part of a complex type. 5) For cases where a parameter is optional, so, have a None value by default, meaning the value is not required (as I understand if it was the case of `fillna(value=None)` value wouldn't be optional, as it means is the value used to replace `NaN`). In this case, the proposal is to use as the type, something like "int or float, optional" over "int, float or None (default None)". 6) When the parameter expects something in the form of a Python list, a numpy array, a pandas Series... document it as "array-like" over other options list "iterable" or "numpy.array, Series or list". Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From cbartak at gmail.com Mon Mar 5 17:07:28 2018 From: cbartak at gmail.com (Chris Bartak) Date: Mon, 5 Mar 2018 16:07:28 -0600 Subject: [Pandas-dev] Open questions regarding docstrings In-Reply-To: References: Message-ID: Hi Marc, Thanks for pulling out this list. The only one of these that seems potentially objectionable to me is #3 - it does seem like we're pretty inconsistent on this currently, but in my opinion it'd be better to side with `str` - matching the actual python type, numpy, mpyp annotations, etc? On Mon, Mar 5, 2018 at 2:55 PM, Marc Garcia wrote: > Hi there, > > There are few things regarding the docstrings, that are still open to > discussion, in many cases because the numpy convention (or numpy doc > examples) is different for the unwritten convention used in most pandas > docstrings. > > Probably you've seen the discussion in GitHub, but I list them here, with > the proposed decision (mainly keep the pandas way). If anyone disagrees in > any point, please let us know, so we'll change the documentation for the > sprint, and do it in the desired way. > > 1) Starting the docstring just after the opening triple quotes, or in the > next line. In pandas it's more common to do it in the next line, so we'll > keep it this way. > > 2) For parameters, showing the default value after the type, or after the > description. Numpy does not find it necessary to specify them, and it > specified the recommended place is after the description. The proposal > (mainly by Joris) is to always have them and after the type, as it's easier > to see it. > > 3) For parameters expecting a string, in the numpy convention examples > `str` is used, the proposal is to use `string` instead. > > 4) For complex types like dicts, I think there is some consensus that is > easier to understand the types if using brackets (e.g. "dict of {str: int}" > over "dict of str: int"). And same for tuples (e.g. "tuple of (int, str, > int)" over "tuple of int, str, int"). For list and sets, the type is > simpler (e.g. "list of int" or "set of str"). I propose to use the brackets > for list and tuple, and not for list and set, and use `str` over `string` > if part of a complex type. > > 5) For cases where a parameter is optional, so, have a None value by > default, meaning the value is not required (as I understand if it was the > case of `fillna(value=None)` value wouldn't be optional, as it means is the > value used to replace `NaN`). In this case, the proposal is to use as the > type, something like "int or float, optional" over "int, float or None > (default None)". > > 6) When the parameter expects something in the form of a Python list, a > numpy array, a pandas Series... document it as "array-like" over other > options list "iterable" or "numpy.array, Series or list". > > > Thanks! > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Mon Mar 5 17:58:21 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Mon, 5 Mar 2018 14:58:21 -0800 Subject: [Pandas-dev] Open questions regarding docstrings In-Reply-To: References: Message-ID: Agreed with Chris about 3. In the same vein, about 4 and 6, I'd could see more precision in the docstrings as an aid to adopting function annotations and mypy in the future. Is List[int] too ugly / unusual for readers? Case in point, one of your examples from 6, a Python list, isn't array like (in the sense that is_array_like(List) is False). Documenting exactly what we mean by array-like is probably not something we're ready for, but I'd like to hear what others thing about adopting mypy's spelling of types where it's not too burdensome. Tom On Mon, Mar 5, 2018 at 2:07 PM, Chris Bartak wrote: > Hi Marc, > > Thanks for pulling out this list. The only one of these that seems > potentially objectionable to me is #3 - it does seem like we're pretty > inconsistent on this currently, but in my opinion it'd be better to side > with `str` - matching the actual python type, numpy, mpyp annotations, etc? > > On Mon, Mar 5, 2018 at 2:55 PM, Marc Garcia wrote: > >> Hi there, >> >> There are few things regarding the docstrings, that are still open to >> discussion, in many cases because the numpy convention (or numpy doc >> examples) is different for the unwritten convention used in most pandas >> docstrings. >> >> Probably you've seen the discussion in GitHub, but I list them here, with >> the proposed decision (mainly keep the pandas way). If anyone disagrees in >> any point, please let us know, so we'll change the documentation for the >> sprint, and do it in the desired way. >> >> 1) Starting the docstring just after the opening triple quotes, or in the >> next line. In pandas it's more common to do it in the next line, so we'll >> keep it this way. >> >> 2) For parameters, showing the default value after the type, or after the >> description. Numpy does not find it necessary to specify them, and it >> specified the recommended place is after the description. The proposal >> (mainly by Joris) is to always have them and after the type, as it's easier >> to see it. >> >> 3) For parameters expecting a string, in the numpy convention examples >> `str` is used, the proposal is to use `string` instead. >> >> 4) For complex types like dicts, I think there is some consensus that is >> easier to understand the types if using brackets (e.g. "dict of {str: int}" >> over "dict of str: int"). And same for tuples (e.g. "tuple of (int, str, >> int)" over "tuple of int, str, int"). For list and sets, the type is >> simpler (e.g. "list of int" or "set of str"). I propose to use the brackets >> for list and tuple, and not for list and set, and use `str` over `string` >> if part of a complex type. >> >> 5) For cases where a parameter is optional, so, have a None value by >> default, meaning the value is not required (as I understand if it was the >> case of `fillna(value=None)` value wouldn't be optional, as it means is the >> value used to replace `NaN`). In this case, the proposal is to use as the >> type, something like "int or float, optional" over "int, float or None >> (default None)". >> >> 6) When the parameter expects something in the form of a Python list, a >> numpy array, a pandas Series... document it as "array-like" over other >> options list "iterable" or "numpy.array, Series or list". >> >> >> Thanks! >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Mon Mar 5 18:05:13 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 6 Mar 2018 00:05:13 +0100 Subject: [Pandas-dev] Open questions regarding docstrings In-Reply-To: References: Message-ID: Yes, thanks for the overview. Regarding the type descriptions, as a reference, an overview of all currently used type descriptions can be seen here: https://github.com/pandas-dev/pandas/pull/19704#issuecomment-369405611 >From that you can see that many things are now rather inconsistent .. (str vs string, optional vs default None, ... in most cases rather equally used). So we should make choices! :) For str vs string: I *think* "string" can be more readable and understandable for newcomers (not sure how well known the str type is for this user group). But of course, if taking "string" rather than "str", we should maybe also look at "int" vs "integer", "bool" vs "boolean", etc. I can live with either decision. Joris 2018-03-05 23:07 GMT+01:00 Chris Bartak : > Hi Marc, > > Thanks for pulling out this list. The only one of these that seems > potentially objectionable to me is #3 - it does seem like we're pretty > inconsistent on this currently, but in my opinion it'd be better to side > with `str` - matching the actual python type, numpy, mpyp annotations, etc? > > On Mon, Mar 5, 2018 at 2:55 PM, Marc Garcia wrote: > >> Hi there, >> >> There are few things regarding the docstrings, that are still open to >> discussion, in many cases because the numpy convention (or numpy doc >> examples) is different for the unwritten convention used in most pandas >> docstrings. >> >> Probably you've seen the discussion in GitHub, but I list them here, with >> the proposed decision (mainly keep the pandas way). If anyone disagrees in >> any point, please let us know, so we'll change the documentation for the >> sprint, and do it in the desired way. >> >> 1) Starting the docstring just after the opening triple quotes, or in the >> next line. In pandas it's more common to do it in the next line, so we'll >> keep it this way. >> >> 2) For parameters, showing the default value after the type, or after the >> description. Numpy does not find it necessary to specify them, and it >> specified the recommended place is after the description. The proposal >> (mainly by Joris) is to always have them and after the type, as it's easier >> to see it. >> >> 3) For parameters expecting a string, in the numpy convention examples >> `str` is used, the proposal is to use `string` instead. >> >> 4) For complex types like dicts, I think there is some consensus that is >> easier to understand the types if using brackets (e.g. "dict of {str: int}" >> over "dict of str: int"). And same for tuples (e.g. "tuple of (int, str, >> int)" over "tuple of int, str, int"). For list and sets, the type is >> simpler (e.g. "list of int" or "set of str"). I propose to use the brackets >> for list and tuple, and not for list and set, and use `str` over `string` >> if part of a complex type. >> >> 5) For cases where a parameter is optional, so, have a None value by >> default, meaning the value is not required (as I understand if it was the >> case of `fillna(value=None)` value wouldn't be optional, as it means is the >> value used to replace `NaN`). In this case, the proposal is to use as the >> type, something like "int or float, optional" over "int, float or None >> (default None)". >> >> 6) When the parameter expects something in the form of a Python list, a >> numpy array, a pandas Series... document it as "array-like" over other >> options list "iterable" or "numpy.array, Series or list". >> >> >> Thanks! >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ml at pietrobattiston.it Thu Mar 15 13:36:38 2018 From: ml at pietrobattiston.it (Pietro Battiston) Date: Thu, 15 Mar 2018 18:36:38 +0100 Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where") Message-ID: <1521135398.20162.23.camel@pietrobattiston.it> Dear pandas devs, like most (I think) of you, I love how pandas supports chained assignments. And like several other users, I get frustrated when I have to break some chained sequence of calls because a given operation cannot be included. See for instance https://stackoverflow.com/q/11869910/2858145 https://stackoverflow.com/q/40028500/2858145 https://stackoverflow.com/q/44912692/2858145 I ended up noticing that most of the time, the problematic operation is a filtering, since it is typically done as df.loc[condition_on(df)] e.g. df.loc[df['a'] > 3] In R, we would do (something more similar to) df.loc[a>3] ... but we can't in Python syntax. This is not usually a huge deal - one could even claim that "df[df['a'] > 3]" is nicer because it's more explicit. Still, when it's not df but rather a 5 lines chained assignment, one needs to create the df, and then filter it, which is annoying. There are a couple of other solutions: df.filter, adding an ad-hoc method to pandas objects... but I never found any of them general and/or pythonic enough. So I tried with an alternative: lazy evaluation. It took relatively few lines of code, and after some weeks of use, I'm really satisfied of the result: https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda s/where.py (do not bother about the rest of the repo, the file works as a standalone module). This allows to replace df.loc[df['a'] > 2] with df.loc[W['a'] > 2] ... and to apply virtually any operation one would apply to df (more precisely, any operation... which is chainable).? As a bonus, one can write a condition and reuse it to filter several pandas objects. I'm writing this email to ask: - whether you have in mind some alternative solution I did not consider to the problem of "unchainable filterings" - whether you have suggestions on how to improve my solution - whether you think this is worth merging in pandas (the amount of monkey patching required is so small that it is not burdensome to keep it separated - it just means one more dependency for users who want to use it) For the records:?it currently works only in .loc... and I don't expect this to change: I guess pd.{Series,DataFrame}.__getitem__ already support too many different mechanisms. Supporting .loc as setter should be instead pretty straightforward - it is just lower priority as not used in chaining. Pietro ? Only exception (I know of) at the moment: W.loc(axis=1)[.] won't work, because I "taught" it that "loc" is not a callable. Shouldn't be hard to fix. From tom.augspurger88 at gmail.com Thu Mar 15 13:58:23 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 15 Mar 2018 12:58:23 -0500 Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where") In-Reply-To: <1521135398.20162.23.camel@pietrobattiston.it> References: <1521135398.20162.23.camel@pietrobattiston.it> Message-ID: FYI, `df.loc[lambda x: x['a'] > 3]` is valid. loc takes a callable, and evaluates it with the NDFrame as the first (only) argument. So the downside is now that `lambda x:` is a bit more to type that `W`, but it's not so bad. And if you have a pre-defined method for filtering, it's `df.loc[condition_on]`, which is the shortest (but maybe not clearest) way of spelling that. - Tom On Thu, Mar 15, 2018 at 12:36 PM, Pietro Battiston wrote: > Dear pandas devs, > > like most (I think) of you, I love how pandas supports chained > assignments. > > And like several other users, I get frustrated when I have to break > some chained sequence of calls because a given operation cannot be > included. See for instance > https://stackoverflow.com/q/11869910/2858145 > https://stackoverflow.com/q/40028500/2858145 > https://stackoverflow.com/q/44912692/2858145 > > I ended up noticing that most of the time, the problematic operation is > a filtering, since it is typically done as > > df.loc[condition_on(df)] > > e.g. > > df.loc[df['a'] > 3] > > In R, we would do (something more similar to) > > df.loc[a>3] > > ... but we can't in Python syntax. This is not usually a huge deal - > one could even claim that "df[df['a'] > 3]" is nicer because it's more > explicit. > Still, when it's not df but rather a 5 lines chained assignment, one > needs to create the df, and then filter it, which is annoying. > > There are a couple of other solutions: df.filter, adding an ad-hoc > method to pandas objects... but I never found any of them general > and/or pythonic enough. So I tried with an alternative: lazy > evaluation. It took relatively few lines of code, and after some weeks > of use, I'm really satisfied of the result: > https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda > s/where.py > (do not bother about the rest of the repo, the file works as a > standalone module). > > This allows to replace > df.loc[df['a'] > 2] > with > df.loc[W['a'] > 2] > ... and to apply virtually any operation one would apply to df (more > precisely, any operation... which is chainable).? > As a bonus, one can write a condition and reuse it to filter several > pandas objects. > > I'm writing this email to ask: > - whether you have in mind some alternative solution I did not consider > to the problem of "unchainable filterings" > - whether you have suggestions on how to improve my solution > - whether you think this is worth merging in pandas (the amount of > monkey patching required is so small that it is not burdensome to keep > it separated - it just means one more dependency for users who want to > use it) > > For the records: it currently works only in .loc... and I don't expect > this to change: I guess pd.{Series,DataFrame}.__getitem__ already > support too many different mechanisms. > > Supporting .loc as setter should be instead pretty straightforward - it > is just lower priority as not used in chaining. > > Pietro > > > ? Only exception (I know of) at the moment: W.loc(axis=1)[.] won't > work, because I "taught" it that "loc" is not a callable. Shouldn't be > hard to fix. > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cbartak at gmail.com Thu Mar 15 14:03:18 2018 From: cbartak at gmail.com (Chris Bartak) Date: Thu, 15 Mar 2018 13:03:18 -0500 Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where") In-Reply-To: <1521135398.20162.23.camel@pietrobattiston.it> References: <1521135398.20162.23.camel@pietrobattiston.it> Message-ID: If you're not aware, we do have one (IMO ugly) solution for this using lambdas function_making_df().loc[lambda x: x['a'] > 3] There is also some prior art of pandas_ply[1] and dplython [2]. I had a WIP PR adding a version of pandas_ply to pandas [3], but never finished it out, there was some concern about the API expansion. I was ultimately using a lambda as the delivery mechanism. I am in favor of the general concept, though wonder if there is a better long term solution around expansion of the python language for some kind of light macro support and/or or a fully delayed expression system, ala ibis. [1] - https://github.com/coursera/pandas-ply [2] - https://github.com/dodger487/dplython [3] - https://github.com/pandas-dev/pandas/pull/14209 On Thu, Mar 15, 2018 at 12:36 PM, Pietro Battiston wrote: > Dear pandas devs, > > like most (I think) of you, I love how pandas supports chained > assignments. > > And like several other users, I get frustrated when I have to break > some chained sequence of calls because a given operation cannot be > included. See for instance > https://stackoverflow.com/q/11869910/2858145 > https://stackoverflow.com/q/40028500/2858145 > https://stackoverflow.com/q/44912692/2858145 > > I ended up noticing that most of the time, the problematic operation is > a filtering, since it is typically done as > > df.loc[condition_on(df)] > > e.g. > > df.loc[df['a'] > 3] > > In R, we would do (something more similar to) > > df.loc[a>3] > > ... but we can't in Python syntax. This is not usually a huge deal - > one could even claim that "df[df['a'] > 3]" is nicer because it's more > explicit. > Still, when it's not df but rather a 5 lines chained assignment, one > needs to create the df, and then filter it, which is annoying. > > There are a couple of other solutions: df.filter, adding an ad-hoc > method to pandas objects... but I never found any of them general > and/or pythonic enough. So I tried with an alternative: lazy > evaluation. It took relatively few lines of code, and after some weeks > of use, I'm really satisfied of the result: > https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda > s/where.py > (do not bother about the rest of the repo, the file works as a > standalone module). > > This allows to replace > df.loc[df['a'] > 2] > with > df.loc[W['a'] > 2] > ... and to apply virtually any operation one would apply to df (more > precisely, any operation... which is chainable).? > As a bonus, one can write a condition and reuse it to filter several > pandas objects. > > I'm writing this email to ask: > - whether you have in mind some alternative solution I did not consider > to the problem of "unchainable filterings" > - whether you have suggestions on how to improve my solution > - whether you think this is worth merging in pandas (the amount of > monkey patching required is so small that it is not burdensome to keep > it separated - it just means one more dependency for users who want to > use it) > > For the records: it currently works only in .loc... and I don't expect > this to change: I guess pd.{Series,DataFrame}.__getitem__ already > support too many different mechanisms. > > Supporting .loc as setter should be instead pretty straightforward - it > is just lower priority as not used in chaining. > > Pietro > > > ? Only exception (I know of) at the moment: W.loc(axis=1)[.] won't > work, because I "taught" it that "loc" is not a callable. Shouldn't be > hard to fix. > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ml at pietrobattiston.it Thu Mar 15 14:24:12 2018 From: ml at pietrobattiston.it (Pietro Battiston) Date: Thu, 15 Mar 2018 19:24:12 +0100 Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where") In-Reply-To: References: <1521135398.20162.23.camel@pietrobattiston.it> Message-ID: <1521138252.20162.25.camel@pietrobattiston.it> Il giorno gio, 15/03/2018 alle 12.58 -0500, Tom Augspurger ha scritto: > FYI, `df.loc[lambda x: x['a'] > 3]` is valid. loc takes a callable, > and evaluates it with the NDFrame as the first (only) argument. Aha! I knew one could pass callables, but I had mistakenly assumed the mechanism was analogous to df.apply(), i.e. accepting rows/elements rather than the NDFrame itself. I think I like my solution better... but for sure adding it to pandas would duplicate an?already present functionality. Thanks (to you and Chris) for the pointer, Pietro From ml at pietrobattiston.it Thu Mar 15 14:39:08 2018 From: ml at pietrobattiston.it (Pietro Battiston) Date: Thu, 15 Mar 2018 19:39:08 +0100 Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where") In-Reply-To: References: <1521135398.20162.23.camel@pietrobattiston.it> Message-ID: <1521139148.20162.27.camel@pietrobattiston.it> Il giorno gio, 15/03/2018 alle 13.03 -0500, Chris Bartak ha scritto: > [...] > There is also some prior art of pandas_ply[1] and dplython [2].? I > had a WIP PR adding a version of pandas_ply to pandas [3], but never > finished it out, there was some concern about the API expansion.? I > was ultimately using a lambda as the delivery mechanism. > > I am in favor of the general concept, though wonder if there is a > better long term solution around expansion of the python language for > some kind of light macro support and/or or a fully delayed expression > system, ala ibis. > > [1] -?https://github.com/coursera/pandas-ply > [2] -?https://github.com/dodger487/dplython > [3] -?https://github.com/pandas-dev/pandas/pull/14209 Funny... basically the same API, completely different implementation. I guess I will steal the idea to make it callable (avoiding monkey- patching, and supporting assign())... thanks! Pietro From justin.lewis at gmail.com Thu Mar 15 15:10:02 2018 From: justin.lewis at gmail.com (Justin Lewis) Date: Thu, 15 Mar 2018 15:10:02 -0400 Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where") In-Reply-To: <1521139148.20162.27.camel@pietrobattiston.it> References: <1521135398.20162.23.camel@pietrobattiston.it> <1521139148.20162.27.camel@pietrobattiston.it> Message-ID: I might be missing the point but can you use .pipe()? In [1]: df = pd.util.testing.makeTimeDataFrame() In [2]: df Out[2]: A B C D 2000-01-03 -0.870800 0.517496 -1.129341 1.074059 2000-01-04 -0.102295 1.811238 -2.080829 -1.145249 2000-01-05 -0.608380 -0.754805 1.196582 1.480967 2000-01-06 0.358763 -0.929273 0.190293 0.191154 2000-01-07 1.984208 0.579810 -0.369664 1.583910 ... ... ... ... ... 2000-02-07 0.917228 -0.200213 0.893922 -0.960147 2000-02-08 0.490313 0.728865 -0.978162 1.028735 2000-02-09 1.415720 -0.855196 1.868628 -0.247138 2000-02-10 0.613818 0.488457 -1.042366 -1.831410 2000-02-11 -1.433825 0.062954 -0.856178 -0.273247 [30 rows x 4 columns] In [3]: df.pipe(lambda x: x[x.A > .1]) Out[3]: A B C D 2000-01-06 0.358763 -0.929273 0.190293 0.191154 2000-01-07 1.984208 0.579810 -0.369664 1.583910 2000-01-10 0.872874 -1.378924 0.644806 0.988295 2000-01-11 0.252953 -0.181655 0.049428 0.545417 2000-01-13 0.602725 -0.221286 -0.208824 -0.913126 ... ... ... ... ... 2000-02-04 0.319361 -0.664777 -0.460101 0.111564 2000-02-07 0.917228 -0.200213 0.893922 -0.960147 2000-02-08 0.490313 0.728865 -0.978162 1.028735 2000-02-09 1.415720 -0.855196 1.868628 -0.247138 2000-02-10 0.613818 0.488457 -1.042366 -1.831410 [17 rows x 4 columns] In [4]: df.pipe(lambda x: x[x.A > .1]).pipe(lambda x: x[x.B > .1]) Out[4]: A B C D 2000-01-07 1.984208 0.579810 -0.369664 1.583910 2000-01-25 0.724618 2.134328 0.269921 1.633488 2000-01-26 1.011798 0.989021 -1.472997 0.849001 2000-02-02 0.300020 0.490800 1.786019 1.389062 2000-02-03 0.729878 0.341635 -0.972437 -0.670142 2000-02-08 0.490313 0.728865 -0.978162 1.028735 2000-02-10 0.613818 0.488457 -1.042366 -1.831410 In [5]: On Thu, Mar 15, 2018 at 2:39 PM, Pietro Battiston wrote: > Il giorno gio, 15/03/2018 alle 13.03 -0500, Chris Bartak ha scritto: > > [...] > > There is also some prior art of pandas_ply[1] and dplython [2]. I > > had a WIP PR adding a version of pandas_ply to pandas [3], but never > > finished it out, there was some concern about the API expansion. I > > was ultimately using a lambda as the delivery mechanism. > > > > I am in favor of the general concept, though wonder if there is a > > better long term solution around expansion of the python language for > > some kind of light macro support and/or or a fully delayed expression > > system, ala ibis. > > > > [1] - https://github.com/coursera/pandas-ply > > [2] - https://github.com/dodger487/dplython > > [3] - https://github.com/pandas-dev/pandas/pull/14209 > > > Funny... basically the same API, completely different implementation. I > guess I will steal the idea to make it callable (avoiding monkey- > patching, and supporting assign())... thanks! > > Pietro > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ml at pietrobattiston.it Thu Mar 22 10:35:31 2018 From: ml at pietrobattiston.it (Pietro Battiston) Date: Thu, 22 Mar 2018 15:35:31 +0100 Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where") In-Reply-To: References: <1521135398.20162.23.camel@pietrobattiston.it> <1521139148.20162.27.camel@pietrobattiston.it> Message-ID: <1521729331.25305.87.camel@pietrobattiston.it> Il giorno gio, 15/03/2018 alle 15.10 -0400, Justin Lewis ha scritto: > I might be missing the point but can you use .pipe()? Indeed, this is something else I had not considered. However I don't like it to much. Compare .loc[W] with .pipe(lambda df : df[df]) By the way, .loc[lambda df : df[df]] is equivalent but cleaner to me (after all, we are selecting). This said, the solutions proposed by you and Chris are indeed more robust then mine. For instance, .loc[W + 1 > 2] works but .loc[2 < 1 + W] doesn't, and I don't even know if a fix is possible. Pietro From cpcloud at gmail.com Thu Mar 22 11:24:11 2018 From: cpcloud at gmail.com (Phillip Cloud) Date: Thu, 22 Mar 2018 15:24:11 +0000 Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where") In-Reply-To: <1521729331.25305.87.camel@pietrobattiston.it> References: <1521135398.20162.23.camel@pietrobattiston.it> <1521139148.20162.27.camel@pietrobattiston.it> <1521729331.25305.87.camel@pietrobattiston.it> Message-ID: If you feel like being evil, you can use a so-called ?frame hack? + a context manager: In [1]: import pandas as pd ...: import contextlib ...: import sys ...: ...: ...: class ctx: ...: def __init__(self, df): ...: self.df = df ...: current_frame = sys._getframe(0) ...: self.locals = current_frame.f_back.f_locals ...: self.existing_values = { ...: k: self.locals[k] for k in df.columns ...: if k in self.locals ...: } ...: self.new_values = {k for k in df.columns if k not in self.locals} ...: ...: def __enter__(self): ...: for k in df.columns: ...: self.locals[k] = df[k] ...: return ...: ...: def __exit__(self, *exc): ...: self.locals.update(self.existing_values) ...: for k in self.new_values: ...: del self.locals[k] ...: In [2]: df = pd.DataFrame({'a': np.array([1, 2], dtype='float32')}) In [3]: try: ...: a + 1 ...: except NameError: ...: print("'a' doesn't exist yet!") ...: 'a' doesn't exist yet! In [4]: with ctx(df): ...: print(df[a == 1]) ...: a 0 1.0 In [5]: try: ...: a + 1 ...: except NameError: ...: print("'a' doesn't exist yet!") ...: 'a' doesn't exist yet! ? On Thu, Mar 22, 2018 at 10:35 AM Pietro Battiston wrote: > Il giorno gio, 15/03/2018 alle 15.10 -0400, Justin Lewis ha scritto: > > I might be missing the point but can you use .pipe()? > > Indeed, this is something else I had not considered. > > However I don't like it to much. Compare > > .loc[W] > > with > > .pipe(lambda df : df[df]) > > By the way, > > .loc[lambda df : df[df]] > > is equivalent but cleaner to me (after all, we are selecting). > > This said, the solutions proposed by you and Chris are indeed more > robust then mine. For instance, > > .loc[W + 1 > 2] > > works but > > .loc[2 < 1 + W] > > doesn't, and I don't even know if a fix is possible. > > Pietro > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Thu Mar 22 11:27:52 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 22 Mar 2018 10:27:52 -0500 Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where") In-Reply-To: References: <1521135398.20162.23.camel@pietrobattiston.it> <1521139148.20162.27.camel@pietrobattiston.it> <1521729331.25305.87.camel@pietrobattiston.it> Message-ID: Sounds like someone's been learning from David Beazley :) Now just define DataFrame.__enter__ to pass `self` to `ctx`, and write it as ``` with df: print(df[a == 1]) ``` On Thu, Mar 22, 2018 at 10:24 AM, Phillip Cloud wrote: > If you feel like being evil, you can use a so-called ?frame hack? + a > context manager: > > In [1]: import pandas as pd > ...: import contextlib > ...: import sys > ...: > ...: > ...: class ctx: > ...: def __init__(self, df): > ...: self.df = df > ...: current_frame = sys._getframe(0) > ...: self.locals = current_frame.f_back.f_locals > ...: self.existing_values = { > ...: k: self.locals[k] for k in df.columns > ...: if k in self.locals > ...: } > ...: self.new_values = {k for k in df.columns if k not in self.locals} > ...: > ...: def __enter__(self): > ...: for k in df.columns: > ...: self.locals[k] = df[k] > ...: return > ...: > ...: def __exit__(self, *exc): > ...: self.locals.update(self.existing_values) > ...: for k in self.new_values: > ...: del self.locals[k] > ...: > > In [2]: df = pd.DataFrame({'a': np.array([1, 2], dtype='float32')}) > > In [3]: try: > ...: a + 1 > ...: except NameError: > ...: print("'a' doesn't exist yet!") > ...: > 'a' doesn't exist yet! > > In [4]: with ctx(df): > ...: print(df[a == 1]) > ...: > a > 0 1.0 > > In [5]: try: > ...: a + 1 > ...: except NameError: > ...: print("'a' doesn't exist yet!") > ...: > 'a' doesn't exist yet! > > ? > > On Thu, Mar 22, 2018 at 10:35 AM Pietro Battiston > wrote: > >> Il giorno gio, 15/03/2018 alle 15.10 -0400, Justin Lewis ha scritto: >> > I might be missing the point but can you use .pipe()? >> >> Indeed, this is something else I had not considered. >> >> However I don't like it to much. Compare >> >> .loc[W] >> >> with >> >> .pipe(lambda df : df[df]) >> >> By the way, >> >> .loc[lambda df : df[df]] >> >> is equivalent but cleaner to me (after all, we are selecting). >> >> This said, the solutions proposed by you and Chris are indeed more >> robust then mine. For instance, >> >> .loc[W + 1 > 2] >> >> works but >> >> .loc[2 < 1 + W] >> >> doesn't, and I don't even know if a fix is possible. >> >> Pietro >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leportella at gmail.com Fri Mar 23 09:21:42 2018 From: leportella at gmail.com (Leticia Portella) Date: Fri, 23 Mar 2018 10:21:42 -0300 Subject: [Pandas-dev] Docs translation Message-ID: Hello! I am wondering if there is any effort being made nowadays to translate the pandas documentation to new languages (in my case, brazilian portuguese). If there is, can you tell me where is this being made so I can help? If not, can you help me start a project on transifex or some other tool to make this possible!? Thank you for yout attention. kind regards, Leticia -- Let?cia Portella leportella.com podcast.datascience.pizza -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Fri Mar 23 09:50:59 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 23 Mar 2018 08:50:59 -0500 Subject: [Pandas-dev] Docs translation In-Reply-To: References: Message-ID: IIRC, someone started on Korean translation a while back. Sphinx has some support for localization: http://www.sphinx-doc.org/en/stable/intl.html, though I have no experience. I would recommend seeing if you can adjust pandas' sphinx configuration to use that, translate a small file, and then make a PR with what you found. Then there's the question of ensuring these stay up to date, but perhaps we can worry about that later. On Fri, Mar 23, 2018 at 8:21 AM, Leticia Portella wrote: > Hello! > > I am wondering if there is any effort being made nowadays to translate the > pandas documentation to new languages (in my case, brazilian portuguese). > > If there is, can you tell me where is this being made so I can help? > > If not, can you help me start a project on transifex or some other tool to > make this possible!? > > Thank you for yout attention. > > kind regards, > > Leticia > > > > -- > Let?cia Portella > leportella.com > podcast.datascience.pizza > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Mar 28 06:16:24 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 28 Mar 2018 12:16:24 +0200 Subject: [Pandas-dev] [pydata] Proposal to change the default number of rows for DataFrame display (lower max_rows) In-Reply-To: References: Message-ID: Coming back to this (we are discussing again a concrete PR proposing this change https://github.com/pandas-dev/pandas/pull/20514) 2017-12-08 16:11 GMT+01:00 Tom Augspurger : > On Fri, Dec 8, 2017 at 8:54 AM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> *[Note for those reading it on the pydata mailing list, please answer to >> pandas-dev at python.org to keep discussion >> centralised there]* >> >> Hi all, >> >> I am reposting the mail of Clemens below, but with slightly changed >> focus, as I think the main discussion point is about the number of rows. >> >> The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to >> lower the default number of rows shown when displaying a Series or >> DataFrame from 60 to 20. >> Thoughts on that? >> > > > Personally, I always set the max rows to 10 or 20, so I'd be OK with it if > the community is on board. > I also often set this at a lower value like that (eg typically for tutorials), so I am also in favor of changing *something*. However, my main 'problem' is that, in interactive usage, with a lower default it becomes very cumbersome to actually look at more data (changing the setting just to inspect some data). For example if the new max_rows default would be 10, doing df.head(20) to quickly inspect some more data will still only show 10 rows. We cannot change what a function like head does (it is still a normal repr following the same options, since it needs to actually return a dataframe, not only display it), but therefore, I have another proposal: - We have 2 thresholds instead of 1 (the current 'max_rows'): a number of rows to show *in* a truncated repr, and a max number of rows to show without truncating - For 'big' dataframes, we show a truncated repr. And then I would go even lower than 20 and only show first/last 5 (so like a max_rows of 10) - For 'small' dataframes, we show the full dataframe without truncating, up to the threshold. Of course, then the difficulty is to determine what we call 'big' and 'small', so what is the threshold to show a tuncated repr (and this part will again get more subjective :)). But for example, using the current max_rows of 60: we could show a full repr up to 60 rows, and once the number of rows > 60, we only show 10 (first/last 5). You can then still set both thresholds at the same number (like 20) to not get this variable behaviour. This is actually similar to what numpy arrays do (but with a bigger threshold: eg np.random.randn(1000) shows all 1000 elements, np.random.randn(1001) shows the first/lst 3). It's just an idea, but I think this might be a way to satisfy more use cases at once (and more possibility to fine tune the behaviour). Joris > > Tom > > >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Mar 28 07:44:30 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 28 Mar 2018 13:44:30 +0200 Subject: [Pandas-dev] [pydata] Proposal to change the default number of rows for DataFrame display (lower max_rows) In-Reply-To: References: Message-ID: 2018-03-28 12:16 GMT+02:00 Joris Van den Bossche < jorisvandenbossche at gmail.com>: > Coming back to this (we are discussing again a concrete PR proposing this > change https://github.com/pandas-dev/pandas/pull/20514) > > 2017-12-08 16:11 GMT+01:00 Tom Augspurger : > >> On Fri, Dec 8, 2017 at 8:54 AM, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> *[Note for those reading it on the pydata mailing list, please answer to >>> pandas-dev at python.org to keep discussion >>> centralised there]* >>> >>> Hi all, >>> >>> I am reposting the mail of Clemens below, but with slightly changed >>> focus, as I think the main discussion point is about the number of rows. >>> >>> The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to >>> lower the default number of rows shown when displaying a Series or >>> DataFrame from 60 to 20. >>> Thoughts on that? >>> >> >> >> Personally, I always set the max rows to 10 or 20, so I'd be OK with it >> if the community is on board. >> > > I also often set this at a lower value like that (eg typically for > tutorials), so I am also in favor of changing *something*. > However, my main 'problem' is that, in interactive usage, with a lower > default it becomes very cumbersome to actually look at more data (changing > the setting just to inspect some data). For example if the new max_rows > default would be 10, doing df.head(20) to quickly inspect some more data > will still only show 10 rows. > > We cannot change what a function like head does (it is still a normal > repr following the same options, since it needs to actually return a > dataframe, not only display it), but therefore, I have another proposal: > > - We have 2 thresholds instead of 1 (the current 'max_rows'): a number of > rows to show *in* a truncated repr, and a max number of rows to show > without truncating > - For 'big' dataframes, we show a truncated repr. And then I would go even > lower than 20 and only show first/last 5 (so like a max_rows of 10) > - For 'small' dataframes, we show the full dataframe without truncating, > up to the threshold. > > Of course, then the difficulty is to determine what we call 'big' and > 'small', so what is the threshold to show a tuncated repr (and this part > will again get more subjective :)). > But for example, using the current max_rows of 60: we could show a full > repr up to 60 rows, and once the number of rows > 60, we only show 10 > (first/last 5). > > You can then still set both thresholds at the same number (like 20) to not > get this variable behaviour. > > This is actually similar to what numpy arrays do (but with a bigger > threshold: eg np.random.randn(1000) shows all 1000 elements, > np.random.randn(1001) shows the first/lst 3). > > And it seems this is also what R tibbles do: they have a "print_min" and "print_max" options with exactly this behaviour, only their "print_max" is lower (it's 10 and 20, respectively): options(tibble.print_max = n, tibble.print_min = m): if there are more than > n rows, print only the first m rows. Use options(tibble.print_max = Inf) > to always show all rows. > (from https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) > It's just an idea, but I think this might be a way to satisfy more use > cases at once (and more possibility to fine tune the behaviour). > > Joris > > >> >> Tom >> >> >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu Mar 29 09:29:49 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 29 Mar 2018 15:29:49 +0200 Subject: [Pandas-dev] Welcome Pietro to the core team Message-ID: Hi all, On behalf of the core developers I'd like to welcome Pietro Battiston (@toobaz) to the core dev team. Pietro has been active on a regular basis for the last 2 years, and is especially closely involved in discussions regarding MultiIndexing and did some nice work on that. Thanks for all those contributions, and looking forward to further collaboration and your continued involvement! Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Thu Mar 29 10:53:27 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 29 Mar 2018 10:53:27 -0400 Subject: [Pandas-dev] Welcome Pietro to the core team In-Reply-To: References: Message-ID: Thanks Pietro, and welcome! On Thu, Mar 29, 2018 at 9:29 AM, Joris Van den Bossche wrote: > Hi all, > > On behalf of the core developers I'd like to welcome Pietro Battiston > (@toobaz) to the core dev team. > > Pietro has been active on a regular basis for the last 2 years, and is > especially closely involved in discussions regarding MultiIndexing and did > some nice work on that. > > Thanks for all those contributions, and looking forward to further > collaboration and your continued involvement! > > Joris > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From gfyoung17 at gmail.com Thu Mar 29 10:56:19 2018 From: gfyoung17 at gmail.com (G Young) Date: Thu, 29 Mar 2018 10:56:19 -0400 Subject: [Pandas-dev] Welcome Pietro to the core team In-Reply-To: References: Message-ID: Congrats on joining! Well deserved. On Thu, Mar 29, 2018 at 10:53 AM, Wes McKinney wrote: > Thanks Pietro, and welcome! > > On Thu, Mar 29, 2018 at 9:29 AM, Joris Van den Bossche > wrote: > > Hi all, > > > > On behalf of the core developers I'd like to welcome Pietro Battiston > > (@toobaz) to the core dev team. > > > > Pietro has been active on a regular basis for the last 2 years, and is > > especially closely involved in discussions regarding MultiIndexing and > did > > some nice work on that. > > > > Thanks for all those contributions, and looking forward to further > > collaboration and your continued involvement! > > > > Joris > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Thu Mar 29 19:26:50 2018 From: jeffreback at gmail.com (Jeff Reback) Date: Thu, 29 Mar 2018 19:26:50 -0400 Subject: [Pandas-dev] Welcome Pietro to the core team In-Reply-To: References: Message-ID: <1FE05873-CF7A-4517-808B-CA7DB0CE22C0@gmail.com> welcome to the team! > On Mar 29, 2018, at 10:56 AM, G Young wrote: > > Congrats on joining! Well deserved. > >> On Thu, Mar 29, 2018 at 10:53 AM, Wes McKinney wrote: >> Thanks Pietro, and welcome! >> >> On Thu, Mar 29, 2018 at 9:29 AM, Joris Van den Bossche >> wrote: >> > Hi all, >> > >> > On behalf of the core developers I'd like to welcome Pietro Battiston >> > (@toobaz) to the core dev team. >> > >> > Pietro has been active on a regular basis for the last 2 years, and is >> > especially closely involved in discussions regarding MultiIndexing and did >> > some nice work on that. >> > >> > Thanks for all those contributions, and looking forward to further >> > collaboration and your continued involvement! >> > >> > Joris >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> > >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: