From garcia.marc at gmail.com  Mon Mar  5 15:55:41 2018
From: garcia.marc at gmail.com (Marc Garcia)
Date: Mon, 5 Mar 2018 20:55:41 +0000
Subject: [Pandas-dev] Open questions regarding docstrings
Message-ID: <CAEk5N5s3LWi6a2v3vD350udC6kcKsD3swboferJjJEQR1KHLTA@mail.gmail.com>

Hi there,

There are few things regarding the docstrings, that are still open to
discussion, in many cases because the numpy convention (or numpy doc
examples) is different for the unwritten convention used in most pandas
docstrings.

Probably you've seen the discussion in GitHub, but I list them here, with
the proposed decision (mainly keep the pandas way). If anyone disagrees in
any point, please let us know, so we'll change the documentation for the
sprint, and do it in the desired way.

1) Starting the docstring just after the opening triple quotes, or in the
next line. In pandas it's more common to do it in the next line, so we'll
keep it this way.

2) For parameters, showing the default value after the type, or after the
description. Numpy does not find it necessary to specify them, and it
specified the recommended place is after the description. The proposal
(mainly by Joris) is to always have them and after the type, as it's easier
to see it.

3) For parameters expecting a string, in the numpy convention examples
`str` is used, the proposal is to use `string` instead.

4) For complex types like dicts, I think there is some consensus that is
easier to understand the types if using brackets (e.g. "dict of {str: int}"
over "dict of str: int"). And same for tuples (e.g. "tuple of (int, str,
int)" over "tuple of int, str, int"). For list and sets, the type is
simpler (e.g. "list of int" or "set of str"). I propose to use the brackets
for list and tuple, and not for list and set, and use `str` over `string`
if part of a complex type.

5) For cases where a parameter is optional, so, have a None value by
default, meaning the value is not required (as I understand if it was the
case of `fillna(value=None)` value wouldn't be optional, as it means is the
value used to replace `NaN`). In this case, the proposal is to use as the
type, something like "int or float, optional" over "int, float or None
(default None)".

6) When the parameter expects something in the form of a Python list, a
numpy array, a pandas Series... document it as "array-like" over other
options list "iterable" or "numpy.array, Series or list".


Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180305/3c890468/attachment.html>

From cbartak at gmail.com  Mon Mar  5 17:07:28 2018
From: cbartak at gmail.com (Chris Bartak)
Date: Mon, 5 Mar 2018 16:07:28 -0600
Subject: [Pandas-dev] Open questions regarding docstrings
In-Reply-To: <CAEk5N5s3LWi6a2v3vD350udC6kcKsD3swboferJjJEQR1KHLTA@mail.gmail.com>
References: <CAEk5N5s3LWi6a2v3vD350udC6kcKsD3swboferJjJEQR1KHLTA@mail.gmail.com>
Message-ID: <CAJtmDoC8n+c6NCOVbdQsFuvDeRmT2OFqDHfEwdtHCKghBUFM9A@mail.gmail.com>

Hi Marc,

Thanks for pulling out this list.  The only one of these that seems
potentially objectionable to me is #3 - it does seem like we're pretty
inconsistent on this currently, but in my opinion it'd be better to side
with `str` - matching the actual python type, numpy, mpyp annotations, etc?

On Mon, Mar 5, 2018 at 2:55 PM, Marc Garcia <garcia.marc at gmail.com> wrote:

> Hi there,
>
> There are few things regarding the docstrings, that are still open to
> discussion, in many cases because the numpy convention (or numpy doc
> examples) is different for the unwritten convention used in most pandas
> docstrings.
>
> Probably you've seen the discussion in GitHub, but I list them here, with
> the proposed decision (mainly keep the pandas way). If anyone disagrees in
> any point, please let us know, so we'll change the documentation for the
> sprint, and do it in the desired way.
>
> 1) Starting the docstring just after the opening triple quotes, or in the
> next line. In pandas it's more common to do it in the next line, so we'll
> keep it this way.
>
> 2) For parameters, showing the default value after the type, or after the
> description. Numpy does not find it necessary to specify them, and it
> specified the recommended place is after the description. The proposal
> (mainly by Joris) is to always have them and after the type, as it's easier
> to see it.
>
> 3) For parameters expecting a string, in the numpy convention examples
> `str` is used, the proposal is to use `string` instead.
>
> 4) For complex types like dicts, I think there is some consensus that is
> easier to understand the types if using brackets (e.g. "dict of {str: int}"
> over "dict of str: int"). And same for tuples (e.g. "tuple of (int, str,
> int)" over "tuple of int, str, int"). For list and sets, the type is
> simpler (e.g. "list of int" or "set of str"). I propose to use the brackets
> for list and tuple, and not for list and set, and use `str` over `string`
> if part of a complex type.
>
> 5) For cases where a parameter is optional, so, have a None value by
> default, meaning the value is not required (as I understand if it was the
> case of `fillna(value=None)` value wouldn't be optional, as it means is the
> value used to replace `NaN`). In this case, the proposal is to use as the
> type, something like "int or float, optional" over "int, float or None
> (default None)".
>
> 6) When the parameter expects something in the form of a Python list, a
> numpy array, a pandas Series... document it as "array-like" over other
> options list "iterable" or "numpy.array, Series or list".
>
>
> Thanks!
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180305/d8972526/attachment.html>

From tom.augspurger88 at gmail.com  Mon Mar  5 17:58:21 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Mon, 5 Mar 2018 14:58:21 -0800
Subject: [Pandas-dev] Open questions regarding docstrings
In-Reply-To: <CAJtmDoC8n+c6NCOVbdQsFuvDeRmT2OFqDHfEwdtHCKghBUFM9A@mail.gmail.com>
References: <CAEk5N5s3LWi6a2v3vD350udC6kcKsD3swboferJjJEQR1KHLTA@mail.gmail.com>
 <CAJtmDoC8n+c6NCOVbdQsFuvDeRmT2OFqDHfEwdtHCKghBUFM9A@mail.gmail.com>
Message-ID: <CAE1aY-=OM5Nkz=nEOcW8Mn-RPitKAXc7XGhn-z=QQSOP0rj1cw@mail.gmail.com>

Agreed with Chris about 3.

In the same vein, about 4 and 6, I'd could see more precision in the
docstrings as an aid to adopting function annotations and mypy in the
future.
Is List[int] too ugly / unusual for readers? Case in point, one of your
examples from 6, a Python list, isn't array like (in the sense that
is_array_like(List) is False).

Documenting exactly what we mean by array-like is probably not something
we're ready for, but I'd like to hear what others thing about adopting
mypy's spelling
of types where it's not too burdensome.

Tom


On Mon, Mar 5, 2018 at 2:07 PM, Chris Bartak <cbartak at gmail.com> wrote:

> Hi Marc,
>
> Thanks for pulling out this list.  The only one of these that seems
> potentially objectionable to me is #3 - it does seem like we're pretty
> inconsistent on this currently, but in my opinion it'd be better to side
> with `str` - matching the actual python type, numpy, mpyp annotations, etc?
>
> On Mon, Mar 5, 2018 at 2:55 PM, Marc Garcia <garcia.marc at gmail.com> wrote:
>
>> Hi there,
>>
>> There are few things regarding the docstrings, that are still open to
>> discussion, in many cases because the numpy convention (or numpy doc
>> examples) is different for the unwritten convention used in most pandas
>> docstrings.
>>
>> Probably you've seen the discussion in GitHub, but I list them here, with
>> the proposed decision (mainly keep the pandas way). If anyone disagrees in
>> any point, please let us know, so we'll change the documentation for the
>> sprint, and do it in the desired way.
>>
>> 1) Starting the docstring just after the opening triple quotes, or in the
>> next line. In pandas it's more common to do it in the next line, so we'll
>> keep it this way.
>>
>> 2) For parameters, showing the default value after the type, or after the
>> description. Numpy does not find it necessary to specify them, and it
>> specified the recommended place is after the description. The proposal
>> (mainly by Joris) is to always have them and after the type, as it's easier
>> to see it.
>>
>> 3) For parameters expecting a string, in the numpy convention examples
>> `str` is used, the proposal is to use `string` instead.
>>
>> 4) For complex types like dicts, I think there is some consensus that is
>> easier to understand the types if using brackets (e.g. "dict of {str: int}"
>> over "dict of str: int"). And same for tuples (e.g. "tuple of (int, str,
>> int)" over "tuple of int, str, int"). For list and sets, the type is
>> simpler (e.g. "list of int" or "set of str"). I propose to use the brackets
>> for list and tuple, and not for list and set, and use `str` over `string`
>> if part of a complex type.
>>
>> 5) For cases where a parameter is optional, so, have a None value by
>> default, meaning the value is not required (as I understand if it was the
>> case of `fillna(value=None)` value wouldn't be optional, as it means is the
>> value used to replace `NaN`). In this case, the proposal is to use as the
>> type, something like "int or float, optional" over "int, float or None
>> (default None)".
>>
>> 6) When the parameter expects something in the form of a Python list, a
>> numpy array, a pandas Series... document it as "array-like" over other
>> options list "iterable" or "numpy.array, Series or list".
>>
>>
>> Thanks!
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180305/151e61df/attachment-0001.html>

From jorisvandenbossche at gmail.com  Mon Mar  5 18:05:13 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 6 Mar 2018 00:05:13 +0100
Subject: [Pandas-dev] Open questions regarding docstrings
In-Reply-To: <CAJtmDoC8n+c6NCOVbdQsFuvDeRmT2OFqDHfEwdtHCKghBUFM9A@mail.gmail.com>
References: <CAEk5N5s3LWi6a2v3vD350udC6kcKsD3swboferJjJEQR1KHLTA@mail.gmail.com>
 <CAJtmDoC8n+c6NCOVbdQsFuvDeRmT2OFqDHfEwdtHCKghBUFM9A@mail.gmail.com>
Message-ID: <CALQtMBbB6kQ1qdk=f=rz3x8xk-v_PdjKW3KB_Azt2PPbiSx3ZA@mail.gmail.com>

Yes, thanks for the overview.

Regarding the type descriptions, as a reference, an overview of all
currently used type descriptions can be seen here:
https://github.com/pandas-dev/pandas/pull/19704#issuecomment-369405611
>From that you can see that many things are now rather inconsistent .. (str
vs string, optional vs default None, ... in most cases rather equally
used). So we should make choices! :)

For str vs string: I *think* "string" can be more readable and
understandable for newcomers (not sure how well known the str type is for
this user group). But of course, if taking "string" rather than "str", we
should maybe also look at "int" vs "integer", "bool" vs "boolean", etc.
I can live with either decision.

Joris


2018-03-05 23:07 GMT+01:00 Chris Bartak <cbartak at gmail.com>:

> Hi Marc,
>
> Thanks for pulling out this list.  The only one of these that seems
> potentially objectionable to me is #3 - it does seem like we're pretty
> inconsistent on this currently, but in my opinion it'd be better to side
> with `str` - matching the actual python type, numpy, mpyp annotations, etc?
>
> On Mon, Mar 5, 2018 at 2:55 PM, Marc Garcia <garcia.marc at gmail.com> wrote:
>
>> Hi there,
>>
>> There are few things regarding the docstrings, that are still open to
>> discussion, in many cases because the numpy convention (or numpy doc
>> examples) is different for the unwritten convention used in most pandas
>> docstrings.
>>
>> Probably you've seen the discussion in GitHub, but I list them here, with
>> the proposed decision (mainly keep the pandas way). If anyone disagrees in
>> any point, please let us know, so we'll change the documentation for the
>> sprint, and do it in the desired way.
>>
>> 1) Starting the docstring just after the opening triple quotes, or in the
>> next line. In pandas it's more common to do it in the next line, so we'll
>> keep it this way.
>>
>> 2) For parameters, showing the default value after the type, or after the
>> description. Numpy does not find it necessary to specify them, and it
>> specified the recommended place is after the description. The proposal
>> (mainly by Joris) is to always have them and after the type, as it's easier
>> to see it.
>>
>> 3) For parameters expecting a string, in the numpy convention examples
>> `str` is used, the proposal is to use `string` instead.
>>
>> 4) For complex types like dicts, I think there is some consensus that is
>> easier to understand the types if using brackets (e.g. "dict of {str: int}"
>> over "dict of str: int"). And same for tuples (e.g. "tuple of (int, str,
>> int)" over "tuple of int, str, int"). For list and sets, the type is
>> simpler (e.g. "list of int" or "set of str"). I propose to use the brackets
>> for list and tuple, and not for list and set, and use `str` over `string`
>> if part of a complex type.
>>
>> 5) For cases where a parameter is optional, so, have a None value by
>> default, meaning the value is not required (as I understand if it was the
>> case of `fillna(value=None)` value wouldn't be optional, as it means is the
>> value used to replace `NaN`). In this case, the proposal is to use as the
>> type, something like "int or float, optional" over "int, float or None
>> (default None)".
>>
>> 6) When the parameter expects something in the form of a Python list, a
>> numpy array, a pandas Series... document it as "array-like" over other
>> options list "iterable" or "numpy.array, Series or list".
>>
>>
>> Thanks!
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180306/6a6387c7/attachment.html>

From ml at pietrobattiston.it  Thu Mar 15 13:36:38 2018
From: ml at pietrobattiston.it (Pietro Battiston)
Date: Thu, 15 Mar 2018 18:36:38 +0100
Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where")
Message-ID: <1521135398.20162.23.camel@pietrobattiston.it>

Dear pandas devs,

like most (I think) of you, I love how pandas supports chained
assignments.

And like several other users, I get frustrated when I have to break
some chained sequence of calls because a given operation cannot be
included. See for instance
https://stackoverflow.com/q/11869910/2858145
https://stackoverflow.com/q/40028500/2858145
https://stackoverflow.com/q/44912692/2858145

I ended up noticing that most of the time, the problematic operation is
a filtering, since it is typically done as

df.loc[condition_on(df)]

e.g.

df.loc[df['a'] > 3]

In R, we would do (something more similar to)

df.loc[a>3]

... but we can't in Python syntax. This is not usually a huge deal -
one could even claim that "df[df['a'] > 3]" is nicer because it's more
explicit.
Still, when it's not df but rather a 5 lines chained assignment, one
needs to create the df, and then filter it, which is annoying.

There are a couple of other solutions: df.filter, adding an ad-hoc
method to pandas objects... but I never found any of them general
and/or pythonic enough. So I tried with an alternative: lazy
evaluation. It took relatively few lines of code, and after some weeks
of use, I'm really satisfied of the result:
https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda
s/where.py
(do not bother about the rest of the repo, the file works as a
standalone module).

This allows to replace
  df.loc[df['a'] > 2]
with
  df.loc[W['a'] > 2]
... and to apply virtually any operation one would apply to df (more
precisely, any operation... which is chainable).?
As a bonus, one can write a condition and reuse it to filter several
pandas objects.

I'm writing this email to ask:
- whether you have in mind some alternative solution I did not consider
to the problem of "unchainable filterings"
- whether you have suggestions on how to improve my solution
- whether you think this is worth merging in pandas (the amount of
monkey patching required is so small that it is not burdensome to keep
it separated - it just means one more dependency for users who want to
use it)

For the records:?it currently works only in .loc... and I don't expect
this to change: I guess pd.{Series,DataFrame}.__getitem__ already
support too many different mechanisms.

Supporting .loc as setter should be instead pretty straightforward - it
is just lower priority as not used in chaining.

Pietro


? Only exception (I know of) at the moment: W.loc(axis=1)[.] won't
work, because I "taught" it that "loc" is not a callable. Shouldn't be
hard to fix.

From tom.augspurger88 at gmail.com  Thu Mar 15 13:58:23 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Thu, 15 Mar 2018 12:58:23 -0500
Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where")
In-Reply-To: <1521135398.20162.23.camel@pietrobattiston.it>
References: <1521135398.20162.23.camel@pietrobattiston.it>
Message-ID: <CAE1aY-=Vj4+gkYxkF+=EdrXcTz4oEp+izjFr7A2Xa+pz2hK0tg@mail.gmail.com>

FYI, `df.loc[lambda x: x['a'] > 3]` is valid. loc takes a callable, and
evaluates it with the NDFrame as the first (only) argument.

So the downside is now that `lambda x:` is a bit more to type that `W`, but
it's not so bad.

And if you have a pre-defined method for filtering, it's
`df.loc[condition_on]`, which is the shortest (but maybe not clearest) way
of spelling that.

- Tom

On Thu, Mar 15, 2018 at 12:36 PM, Pietro Battiston <ml at pietrobattiston.it>
wrote:

> Dear pandas devs,
>
> like most (I think) of you, I love how pandas supports chained
> assignments.
>
> And like several other users, I get frustrated when I have to break
> some chained sequence of calls because a given operation cannot be
> included. See for instance
> https://stackoverflow.com/q/11869910/2858145
> https://stackoverflow.com/q/40028500/2858145
> https://stackoverflow.com/q/44912692/2858145
>
> I ended up noticing that most of the time, the problematic operation is
> a filtering, since it is typically done as
>
> df.loc[condition_on(df)]
>
> e.g.
>
> df.loc[df['a'] > 3]
>
> In R, we would do (something more similar to)
>
> df.loc[a>3]
>
> ... but we can't in Python syntax. This is not usually a huge deal -
> one could even claim that "df[df['a'] > 3]" is nicer because it's more
> explicit.
> Still, when it's not df but rather a 5 lines chained assignment, one
> needs to create the df, and then filter it, which is annoying.
>
> There are a couple of other solutions: df.filter, adding an ad-hoc
> method to pandas objects... but I never found any of them general
> and/or pythonic enough. So I tried with an alternative: lazy
> evaluation. It took relatively few lines of code, and after some weeks
> of use, I'm really satisfied of the result:
> https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda
> s/where.py
> (do not bother about the rest of the repo, the file works as a
> standalone module).
>
> This allows to replace
>   df.loc[df['a'] > 2]
> with
>   df.loc[W['a'] > 2]
> ... and to apply virtually any operation one would apply to df (more
> precisely, any operation... which is chainable).?
> As a bonus, one can write a condition and reuse it to filter several
> pandas objects.
>
> I'm writing this email to ask:
> - whether you have in mind some alternative solution I did not consider
> to the problem of "unchainable filterings"
> - whether you have suggestions on how to improve my solution
> - whether you think this is worth merging in pandas (the amount of
> monkey patching required is so small that it is not burdensome to keep
> it separated - it just means one more dependency for users who want to
> use it)
>
> For the records: it currently works only in .loc... and I don't expect
> this to change: I guess pd.{Series,DataFrame}.__getitem__ already
> support too many different mechanisms.
>
> Supporting .loc as setter should be instead pretty straightforward - it
> is just lower priority as not used in chaining.
>
> Pietro
>
>
> ? Only exception (I know of) at the moment: W.loc(axis=1)[.] won't
> work, because I "taught" it that "loc" is not a callable. Shouldn't be
> hard to fix.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180315/1ed6a443/attachment.html>

From cbartak at gmail.com  Thu Mar 15 14:03:18 2018
From: cbartak at gmail.com (Chris Bartak)
Date: Thu, 15 Mar 2018 13:03:18 -0500
Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where")
In-Reply-To: <1521135398.20162.23.camel@pietrobattiston.it>
References: <1521135398.20162.23.camel@pietrobattiston.it>
Message-ID: <CAJtmDoBFCqW4KXtmjft=dBKFJnu7fD1arBcvFMt73jtgxZ_PVA@mail.gmail.com>

If you're not aware, we do have one (IMO ugly) solution for this using
lambdas

   function_making_df().loc[lambda x: x['a'] > 3]

There is also some prior art of pandas_ply[1] and dplython [2].  I had a
WIP PR adding a version of pandas_ply to pandas [3], but never finished it
out, there was some concern about the API expansion.  I was ultimately
using a lambda as the delivery mechanism.

I am in favor of the general concept, though wonder if there is a better
long term solution around expansion of the python language for some kind of
light macro support and/or or a fully delayed expression system, ala ibis.

[1] - https://github.com/coursera/pandas-ply
[2] - https://github.com/dodger487/dplython
[3] - https://github.com/pandas-dev/pandas/pull/14209

On Thu, Mar 15, 2018 at 12:36 PM, Pietro Battiston <ml at pietrobattiston.it>
wrote:

> Dear pandas devs,
>
> like most (I think) of you, I love how pandas supports chained
> assignments.
>
> And like several other users, I get frustrated when I have to break
> some chained sequence of calls because a given operation cannot be
> included. See for instance
> https://stackoverflow.com/q/11869910/2858145
> https://stackoverflow.com/q/40028500/2858145
> https://stackoverflow.com/q/44912692/2858145
>
> I ended up noticing that most of the time, the problematic operation is
> a filtering, since it is typically done as
>
> df.loc[condition_on(df)]
>
> e.g.
>
> df.loc[df['a'] > 3]
>
> In R, we would do (something more similar to)
>
> df.loc[a>3]
>
> ... but we can't in Python syntax. This is not usually a huge deal -
> one could even claim that "df[df['a'] > 3]" is nicer because it's more
> explicit.
> Still, when it's not df but rather a 5 lines chained assignment, one
> needs to create the df, and then filter it, which is annoying.
>
> There are a couple of other solutions: df.filter, adding an ad-hoc
> method to pandas objects... but I never found any of them general
> and/or pythonic enough. So I tried with an alternative: lazy
> evaluation. It took relatively few lines of code, and after some weeks
> of use, I'm really satisfied of the result:
> https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda
> s/where.py
> (do not bother about the rest of the repo, the file works as a
> standalone module).
>
> This allows to replace
>   df.loc[df['a'] > 2]
> with
>   df.loc[W['a'] > 2]
> ... and to apply virtually any operation one would apply to df (more
> precisely, any operation... which is chainable).?
> As a bonus, one can write a condition and reuse it to filter several
> pandas objects.
>
> I'm writing this email to ask:
> - whether you have in mind some alternative solution I did not consider
> to the problem of "unchainable filterings"
> - whether you have suggestions on how to improve my solution
> - whether you think this is worth merging in pandas (the amount of
> monkey patching required is so small that it is not burdensome to keep
> it separated - it just means one more dependency for users who want to
> use it)
>
> For the records: it currently works only in .loc... and I don't expect
> this to change: I guess pd.{Series,DataFrame}.__getitem__ already
> support too many different mechanisms.
>
> Supporting .loc as setter should be instead pretty straightforward - it
> is just lower priority as not used in chaining.
>
> Pietro
>
>
> ? Only exception (I know of) at the moment: W.loc(axis=1)[.] won't
> work, because I "taught" it that "loc" is not a callable. Shouldn't be
> hard to fix.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180315/e82b59f5/attachment-0001.html>

From ml at pietrobattiston.it  Thu Mar 15 14:24:12 2018
From: ml at pietrobattiston.it (Pietro Battiston)
Date: Thu, 15 Mar 2018 19:24:12 +0100
Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where")
In-Reply-To: <CAE1aY-=Vj4+gkYxkF+=EdrXcTz4oEp+izjFr7A2Xa+pz2hK0tg@mail.gmail.com>
References: <1521135398.20162.23.camel@pietrobattiston.it>
 <CAE1aY-=Vj4+gkYxkF+=EdrXcTz4oEp+izjFr7A2Xa+pz2hK0tg@mail.gmail.com>
Message-ID: <1521138252.20162.25.camel@pietrobattiston.it>

Il giorno gio, 15/03/2018 alle 12.58 -0500, Tom Augspurger ha scritto:
> FYI, `df.loc[lambda x: x['a'] > 3]` is valid. loc takes a callable,
> and evaluates it with the NDFrame as the first (only) argument.

Aha! I knew one could pass callables, but I had mistakenly assumed the
mechanism was analogous to df.apply(), i.e. accepting rows/elements
rather than the NDFrame itself.

I think I like my solution better... but for sure adding it to pandas
would duplicate an?already present functionality.

Thanks (to you and Chris) for the pointer,

Pietro

From ml at pietrobattiston.it  Thu Mar 15 14:39:08 2018
From: ml at pietrobattiston.it (Pietro Battiston)
Date: Thu, 15 Mar 2018 19:39:08 +0100
Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where")
In-Reply-To: <CAJtmDoBFCqW4KXtmjft=dBKFJnu7fD1arBcvFMt73jtgxZ_PVA@mail.gmail.com>
References: <1521135398.20162.23.camel@pietrobattiston.it>
 <CAJtmDoBFCqW4KXtmjft=dBKFJnu7fD1arBcvFMt73jtgxZ_PVA@mail.gmail.com>
Message-ID: <1521139148.20162.27.camel@pietrobattiston.it>

Il giorno gio, 15/03/2018 alle 13.03 -0500, Chris Bartak ha scritto:
> [...]
> There is also some prior art of pandas_ply[1] and dplython [2].? I
> had a WIP PR adding a version of pandas_ply to pandas [3], but never
> finished it out, there was some concern about the API expansion.? I
> was ultimately using a lambda as the delivery mechanism.
> 
> I am in favor of the general concept, though wonder if there is a
> better long term solution around expansion of the python language for
> some kind of light macro support and/or or a fully delayed expression
> system, ala ibis.
> 
> [1] -?https://github.com/coursera/pandas-ply
> [2] -?https://github.com/dodger487/dplython
> [3] -?https://github.com/pandas-dev/pandas/pull/14209


Funny... basically the same API, completely different implementation. I
guess I will steal the idea to make it callable (avoiding monkey-
patching, and supporting assign())... thanks!

Pietro

From justin.lewis at gmail.com  Thu Mar 15 15:10:02 2018
From: justin.lewis at gmail.com (Justin Lewis)
Date: Thu, 15 Mar 2018 15:10:02 -0400
Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where")
In-Reply-To: <1521139148.20162.27.camel@pietrobattiston.it>
References: <1521135398.20162.23.camel@pietrobattiston.it>
 <CAJtmDoBFCqW4KXtmjft=dBKFJnu7fD1arBcvFMt73jtgxZ_PVA@mail.gmail.com>
 <1521139148.20162.27.camel@pietrobattiston.it>
Message-ID: <CALiai49mcic0=+0CohQ7-afaAPs2mGRw+PMZ80fi6ymrpx3OWQ@mail.gmail.com>

I might be missing the point but can you use .pipe()?


In [1]: df = pd.util.testing.makeTimeDataFrame()

In [2]: df
Out[2]:
                   A         B         C         D
2000-01-03 -0.870800  0.517496 -1.129341  1.074059
2000-01-04 -0.102295  1.811238 -2.080829 -1.145249
2000-01-05 -0.608380 -0.754805  1.196582  1.480967
2000-01-06  0.358763 -0.929273  0.190293  0.191154
2000-01-07  1.984208  0.579810 -0.369664  1.583910
...              ...       ...       ...       ...
2000-02-07  0.917228 -0.200213  0.893922 -0.960147
2000-02-08  0.490313  0.728865 -0.978162  1.028735
2000-02-09  1.415720 -0.855196  1.868628 -0.247138
2000-02-10  0.613818  0.488457 -1.042366 -1.831410
2000-02-11 -1.433825  0.062954 -0.856178 -0.273247

[30 rows x 4 columns]

In [3]: df.pipe(lambda x: x[x.A > .1])
Out[3]:
                   A         B         C         D
2000-01-06  0.358763 -0.929273  0.190293  0.191154
2000-01-07  1.984208  0.579810 -0.369664  1.583910
2000-01-10  0.872874 -1.378924  0.644806  0.988295
2000-01-11  0.252953 -0.181655  0.049428  0.545417
2000-01-13  0.602725 -0.221286 -0.208824 -0.913126
...              ...       ...       ...       ...
2000-02-04  0.319361 -0.664777 -0.460101  0.111564
2000-02-07  0.917228 -0.200213  0.893922 -0.960147
2000-02-08  0.490313  0.728865 -0.978162  1.028735
2000-02-09  1.415720 -0.855196  1.868628 -0.247138
2000-02-10  0.613818  0.488457 -1.042366 -1.831410

[17 rows x 4 columns]

In [4]: df.pipe(lambda x: x[x.A > .1]).pipe(lambda x: x[x.B > .1])
Out[4]:
                   A         B         C         D
2000-01-07  1.984208  0.579810 -0.369664  1.583910
2000-01-25  0.724618  2.134328  0.269921  1.633488
2000-01-26  1.011798  0.989021 -1.472997  0.849001
2000-02-02  0.300020  0.490800  1.786019  1.389062
2000-02-03  0.729878  0.341635 -0.972437 -0.670142
2000-02-08  0.490313  0.728865 -0.978162  1.028735
2000-02-10  0.613818  0.488457 -1.042366 -1.831410

In [5]:


On Thu, Mar 15, 2018 at 2:39 PM, Pietro Battiston <ml at pietrobattiston.it>
wrote:

> Il giorno gio, 15/03/2018 alle 13.03 -0500, Chris Bartak ha scritto:
> > [...]
> > There is also some prior art of pandas_ply[1] and dplython [2].  I
> > had a WIP PR adding a version of pandas_ply to pandas [3], but never
> > finished it out, there was some concern about the API expansion.  I
> > was ultimately using a lambda as the delivery mechanism.
> >
> > I am in favor of the general concept, though wonder if there is a
> > better long term solution around expansion of the python language for
> > some kind of light macro support and/or or a fully delayed expression
> > system, ala ibis.
> >
> > [1] - https://github.com/coursera/pandas-ply
> > [2] - https://github.com/dodger487/dplython
> > [3] - https://github.com/pandas-dev/pandas/pull/14209
>
>
> Funny... basically the same API, completely different implementation. I
> guess I will steal the idea to make it callable (avoiding monkey-
> patching, and supporting assign())... thanks!
>
> Pietro
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180315/f14a3b76/attachment.html>

From ml at pietrobattiston.it  Thu Mar 22 10:35:31 2018
From: ml at pietrobattiston.it (Pietro Battiston)
Date: Thu, 22 Mar 2018 15:35:31 +0100
Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where")
In-Reply-To: <CALiai49mcic0=+0CohQ7-afaAPs2mGRw+PMZ80fi6ymrpx3OWQ@mail.gmail.com>
References: <1521135398.20162.23.camel@pietrobattiston.it>
 <CAJtmDoBFCqW4KXtmjft=dBKFJnu7fD1arBcvFMt73jtgxZ_PVA@mail.gmail.com>
 <1521139148.20162.27.camel@pietrobattiston.it>
 <CALiai49mcic0=+0CohQ7-afaAPs2mGRw+PMZ80fi6ymrpx3OWQ@mail.gmail.com>
Message-ID: <1521729331.25305.87.camel@pietrobattiston.it>

Il giorno gio, 15/03/2018 alle 15.10 -0400, Justin Lewis ha scritto:
> I might be missing the point but can you use .pipe()?

Indeed, this is something else I had not considered.

However I don't like it to much. Compare

.loc[W]

with

.pipe(lambda df : df[df])

By the way,

.loc[lambda df : df[df]]

is equivalent but cleaner to me (after all, we are selecting).

This said, the solutions proposed by you and Chris are indeed more
robust then mine. For instance,

.loc[W + 1 > 2]

works but 

.loc[2 < 1 + W]

doesn't, and I don't even know if a fix is possible.

Pietro

From cpcloud at gmail.com  Thu Mar 22 11:24:11 2018
From: cpcloud at gmail.com (Phillip Cloud)
Date: Thu, 22 Mar 2018 15:24:11 +0000
Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where")
In-Reply-To: <1521729331.25305.87.camel@pietrobattiston.it>
References: <1521135398.20162.23.camel@pietrobattiston.it>
 <CAJtmDoBFCqW4KXtmjft=dBKFJnu7fD1arBcvFMt73jtgxZ_PVA@mail.gmail.com>
 <1521139148.20162.27.camel@pietrobattiston.it>
 <CALiai49mcic0=+0CohQ7-afaAPs2mGRw+PMZ80fi6ymrpx3OWQ@mail.gmail.com>
 <1521729331.25305.87.camel@pietrobattiston.it>
Message-ID: <CAKRVfm4yt5uHT6j1FvqapuOBFSjD9qyr8PGWyRUSYZLpxrThJQ@mail.gmail.com>

If you feel like being evil, you can use a so-called ?frame hack? + a
context manager:

In [1]: import pandas as pd
   ...: import contextlib
   ...: import sys
   ...:
   ...:
   ...: class ctx:
   ...:     def __init__(self, df):
   ...:         self.df = df
   ...:         current_frame = sys._getframe(0)
   ...:         self.locals = current_frame.f_back.f_locals
   ...:         self.existing_values = {
   ...:             k: self.locals[k] for k in df.columns
   ...:             if k in self.locals
   ...:         }
   ...:         self.new_values = {k for k in df.columns if k not in
self.locals}
   ...:
   ...:     def __enter__(self):
   ...:         for k in df.columns:
   ...:             self.locals[k] = df[k]
   ...:         return
   ...:
   ...:     def __exit__(self, *exc):
   ...:         self.locals.update(self.existing_values)
   ...:         for k in self.new_values:
   ...:             del self.locals[k]
   ...:

In [2]: df = pd.DataFrame({'a': np.array([1, 2], dtype='float32')})

In [3]: try:
   ...:     a + 1
   ...: except NameError:
   ...:     print("'a' doesn't exist yet!")
   ...:
'a' doesn't exist yet!

In [4]: with ctx(df):
   ...:     print(df[a == 1])
   ...:
     a
0  1.0

In [5]: try:
   ...:     a + 1
   ...: except NameError:
   ...:     print("'a' doesn't exist yet!")
   ...:
'a' doesn't exist yet!

?

On Thu, Mar 22, 2018 at 10:35 AM Pietro Battiston <ml at pietrobattiston.it>
wrote:

> Il giorno gio, 15/03/2018 alle 15.10 -0400, Justin Lewis ha scritto:
> > I might be missing the point but can you use .pipe()?
>
> Indeed, this is something else I had not considered.
>
> However I don't like it to much. Compare
>
> .loc[W]
>
> with
>
> .pipe(lambda df : df[df])
>
> By the way,
>
> .loc[lambda df : df[df]]
>
> is equivalent but cleaner to me (after all, we are selecting).
>
> This said, the solutions proposed by you and Chris are indeed more
> robust then mine. For instance,
>
> .loc[W + 1 > 2]
>
> works but
>
> .loc[2 < 1 + W]
>
> doesn't, and I don't even know if a fix is possible.
>
> Pietro
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180322/fc29a7cc/attachment.html>

From tom.augspurger88 at gmail.com  Thu Mar 22 11:27:52 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Thu, 22 Mar 2018 10:27:52 -0500
Subject: [Pandas-dev] Chained filtering with lazy evaluation ("where")
In-Reply-To: <CAKRVfm4yt5uHT6j1FvqapuOBFSjD9qyr8PGWyRUSYZLpxrThJQ@mail.gmail.com>
References: <1521135398.20162.23.camel@pietrobattiston.it>
 <CAJtmDoBFCqW4KXtmjft=dBKFJnu7fD1arBcvFMt73jtgxZ_PVA@mail.gmail.com>
 <1521139148.20162.27.camel@pietrobattiston.it>
 <CALiai49mcic0=+0CohQ7-afaAPs2mGRw+PMZ80fi6ymrpx3OWQ@mail.gmail.com>
 <1521729331.25305.87.camel@pietrobattiston.it>
 <CAKRVfm4yt5uHT6j1FvqapuOBFSjD9qyr8PGWyRUSYZLpxrThJQ@mail.gmail.com>
Message-ID: <CAE1aY-mSCrJ0_A4texdP_bfa7HKCDJDHgTp3XXZsJq2UxuTnhQ@mail.gmail.com>

Sounds like someone's been learning from David Beazley :)

Now just define DataFrame.__enter__ to pass `self` to `ctx`, and write it as

```
with df:
   print(df[a == 1])
```

On Thu, Mar 22, 2018 at 10:24 AM, Phillip Cloud <cpcloud at gmail.com> wrote:

> If you feel like being evil, you can use a so-called ?frame hack? + a
> context manager:
>
> In [1]: import pandas as pd
>    ...: import contextlib
>    ...: import sys
>    ...:
>    ...:
>    ...: class ctx:
>    ...:     def __init__(self, df):
>    ...:         self.df = df
>    ...:         current_frame = sys._getframe(0)
>    ...:         self.locals = current_frame.f_back.f_locals
>    ...:         self.existing_values = {
>    ...:             k: self.locals[k] for k in df.columns
>    ...:             if k in self.locals
>    ...:         }
>    ...:         self.new_values = {k for k in df.columns if k not in self.locals}
>    ...:
>    ...:     def __enter__(self):
>    ...:         for k in df.columns:
>    ...:             self.locals[k] = df[k]
>    ...:         return
>    ...:
>    ...:     def __exit__(self, *exc):
>    ...:         self.locals.update(self.existing_values)
>    ...:         for k in self.new_values:
>    ...:             del self.locals[k]
>    ...:
>
> In [2]: df = pd.DataFrame({'a': np.array([1, 2], dtype='float32')})
>
> In [3]: try:
>    ...:     a + 1
>    ...: except NameError:
>    ...:     print("'a' doesn't exist yet!")
>    ...:
> 'a' doesn't exist yet!
>
> In [4]: with ctx(df):
>    ...:     print(df[a == 1])
>    ...:
>      a
> 0  1.0
>
> In [5]: try:
>    ...:     a + 1
>    ...: except NameError:
>    ...:     print("'a' doesn't exist yet!")
>    ...:
> 'a' doesn't exist yet!
>
> ?
>
> On Thu, Mar 22, 2018 at 10:35 AM Pietro Battiston <ml at pietrobattiston.it>
> wrote:
>
>> Il giorno gio, 15/03/2018 alle 15.10 -0400, Justin Lewis ha scritto:
>> > I might be missing the point but can you use .pipe()?
>>
>> Indeed, this is something else I had not considered.
>>
>> However I don't like it to much. Compare
>>
>> .loc[W]
>>
>> with
>>
>> .pipe(lambda df : df[df])
>>
>> By the way,
>>
>> .loc[lambda df : df[df]]
>>
>> is equivalent but cleaner to me (after all, we are selecting).
>>
>> This said, the solutions proposed by you and Chris are indeed more
>> robust then mine. For instance,
>>
>> .loc[W + 1 > 2]
>>
>> works but
>>
>> .loc[2 < 1 + W]
>>
>> doesn't, and I don't even know if a fix is possible.
>>
>> Pietro
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180322/4e383a3d/attachment-0001.html>

From leportella at gmail.com  Fri Mar 23 09:21:42 2018
From: leportella at gmail.com (Leticia Portella)
Date: Fri, 23 Mar 2018 10:21:42 -0300
Subject: [Pandas-dev] Docs translation
Message-ID: <CAMW-KMwV_aNcVF-7AKdAFA=Mt_GGR5_OLWvBQ6dzPqEJtLZu9A@mail.gmail.com>

Hello!

I am wondering if there is any effort being made nowadays to translate the
pandas documentation to new languages (in my case, brazilian portuguese).

If there is, can you tell me where is this being made so I can help?

If not, can you help me start a project on transifex or some other tool to
make this possible!?

Thank you for yout attention.

kind regards,

Leticia


-- 
Let?cia Portella
leportella.com
podcast.datascience.pizza
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180323/b3653a0c/attachment.html>

From tom.augspurger88 at gmail.com  Fri Mar 23 09:50:59 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Fri, 23 Mar 2018 08:50:59 -0500
Subject: [Pandas-dev] Docs translation
In-Reply-To: <CAMW-KMwV_aNcVF-7AKdAFA=Mt_GGR5_OLWvBQ6dzPqEJtLZu9A@mail.gmail.com>
References: <CAMW-KMwV_aNcVF-7AKdAFA=Mt_GGR5_OLWvBQ6dzPqEJtLZu9A@mail.gmail.com>
Message-ID: <CAE1aY-=k1ToUdyjod9e9848iNupSPqhPqAy6Zaf+RWdre44aEA@mail.gmail.com>

IIRC, someone started on Korean translation a while back.

Sphinx has some support for localization:
http://www.sphinx-doc.org/en/stable/intl.html, though I have no experience.
I would recommend seeing if you can adjust pandas' sphinx configuration to
use that, translate a small file, and then
make a PR with what you found.

Then there's the question of ensuring these stay up to date, but perhaps we
can worry about that later.

On Fri, Mar 23, 2018 at 8:21 AM, Leticia Portella <leportella at gmail.com>
wrote:

> Hello!
>
> I am wondering if there is any effort being made nowadays to translate the
> pandas documentation to new languages (in my case, brazilian portuguese).
>
> If there is, can you tell me where is this being made so I can help?
>
> If not, can you help me start a project on transifex or some other tool to
> make this possible!?
>
> Thank you for yout attention.
>
> kind regards,
>
> Leticia
>
>
>
> --
> Let?cia Portella
> leportella.com
> podcast.datascience.pizza
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180323/b3014f13/attachment.html>

From jorisvandenbossche at gmail.com  Wed Mar 28 06:16:24 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Wed, 28 Mar 2018 12:16:24 +0200
Subject: [Pandas-dev] [pydata] Proposal to change the default number of
 rows for DataFrame display (lower max_rows)
In-Reply-To: <CAE1aY-kqK-gKGgjjMeqC8v7bxn2=1hC+KX4E87TrRCxCxpJ2_Q@mail.gmail.com>
References: <CALQtMBYsU+1vwiuqFnzcn9H-axhfj1UeBnapreH52Nr8xvgP4Q@mail.gmail.com>
 <CAE1aY-kqK-gKGgjjMeqC8v7bxn2=1hC+KX4E87TrRCxCxpJ2_Q@mail.gmail.com>
Message-ID: <CALQtMBY4La=eO7ThRwLxCRB6VKoiQ_vVNmEqzXWSw1ODmgqv9w@mail.gmail.com>

Coming back to this (we are discussing again a concrete PR proposing this
change https://github.com/pandas-dev/pandas/pull/20514)

2017-12-08 16:11 GMT+01:00 Tom Augspurger <tom.augspurger88 at gmail.com>:

> On Fri, Dec 8, 2017 at 8:54 AM, Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> *[Note for those reading it on the pydata mailing list, please answer to
>> pandas-dev at python.org <pandas-dev at python.org> to keep discussion
>> centralised there]*
>>
>> Hi all,
>>
>> I am reposting the mail of Clemens below, but with slightly changed
>> focus, as I think the main discussion point is about the number of rows.
>>
>> The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to
>> lower the default number of rows shown when displaying a Series or
>> DataFrame from 60 to 20.
>> Thoughts on that?
>>
>
>
> Personally, I always set the max rows to 10 or 20, so I'd be OK with it if
> the community is on board.
>

I also often set this at a lower value like that (eg typically for
tutorials), so I am also in favor of changing *something*.
However, my main 'problem' is that, in interactive usage, with a lower
default it becomes very cumbersome to actually look at more data (changing
the setting just to inspect some data). For example if the new max_rows
default would be 10, doing df.head(20) to quickly inspect some more data
will still only show 10 rows.

We cannot change what a function like head does (it is still a normal repr
following the same options, since it needs to actually return a dataframe,
not only display it), but therefore, I have another proposal:

- We have 2 thresholds instead of 1 (the current 'max_rows'): a number of
rows to show *in* a truncated repr, and a max number of rows to show
without truncating
- For 'big' dataframes, we show a truncated repr. And then I would go even
lower than 20 and only show first/last 5 (so like a max_rows of 10)
- For 'small' dataframes, we show the full dataframe without truncating, up
to the threshold.

Of course, then the difficulty is to determine what we call 'big' and
'small', so what is the threshold to show a tuncated repr (and this part
will again get more subjective :)).
But for example, using the current max_rows of 60: we could show a full
repr up to 60 rows, and once the number of rows > 60, we only show 10
(first/last 5).

You can then still set both thresholds at the same number (like 20) to not
get this variable behaviour.

This is actually similar to what numpy arrays do (but with a bigger
threshold: eg np.random.randn(1000) shows all 1000 elements,
np.random.randn(1001) shows the first/lst 3).

It's just an idea, but I think this might be a way to satisfy more use
cases at once (and more possibility to fine tune the behaviour).

Joris


>
> Tom
>
>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180328/aff73d27/attachment.html>

From jorisvandenbossche at gmail.com  Wed Mar 28 07:44:30 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Wed, 28 Mar 2018 13:44:30 +0200
Subject: [Pandas-dev] [pydata] Proposal to change the default number of
 rows for DataFrame display (lower max_rows)
In-Reply-To: <CALQtMBY4La=eO7ThRwLxCRB6VKoiQ_vVNmEqzXWSw1ODmgqv9w@mail.gmail.com>
References: <CALQtMBYsU+1vwiuqFnzcn9H-axhfj1UeBnapreH52Nr8xvgP4Q@mail.gmail.com>
 <CAE1aY-kqK-gKGgjjMeqC8v7bxn2=1hC+KX4E87TrRCxCxpJ2_Q@mail.gmail.com>
 <CALQtMBY4La=eO7ThRwLxCRB6VKoiQ_vVNmEqzXWSw1ODmgqv9w@mail.gmail.com>
Message-ID: <CALQtMBbjUbv6mzmJ9-4pOiHHzCkq=TGDfqox1w8y0YQCVw0A-A@mail.gmail.com>

2018-03-28 12:16 GMT+02:00 Joris Van den Bossche <
jorisvandenbossche at gmail.com>:

> Coming back to this (we are discussing again a concrete PR proposing this
> change https://github.com/pandas-dev/pandas/pull/20514)
>
> 2017-12-08 16:11 GMT+01:00 Tom Augspurger <tom.augspurger88 at gmail.com>:
>
>> On Fri, Dec 8, 2017 at 8:54 AM, Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> *[Note for those reading it on the pydata mailing list, please answer to
>>> pandas-dev at python.org <pandas-dev at python.org> to keep discussion
>>> centralised there]*
>>>
>>> Hi all,
>>>
>>> I am reposting the mail of Clemens below, but with slightly changed
>>> focus, as I think the main discussion point is about the number of rows.
>>>
>>> The proposal in https://github.com/pandas-dev/pandas/pull/17023 is to
>>> lower the default number of rows shown when displaying a Series or
>>> DataFrame from 60 to 20.
>>> Thoughts on that?
>>>
>>
>>
>> Personally, I always set the max rows to 10 or 20, so I'd be OK with it
>> if the community is on board.
>>
>
> I also often set this at a lower value like that (eg typically for
> tutorials), so I am also in favor of changing *something*.
> However, my main 'problem' is that, in interactive usage, with a lower
> default it becomes very cumbersome to actually look at more data (changing
> the setting just to inspect some data). For example if the new max_rows
> default would be 10, doing df.head(20) to quickly inspect some more data
> will still only show 10 rows.
>
> We cannot change what a function like head does (it is still a normal
> repr following the same options, since it needs to actually return a
> dataframe, not only display it), but therefore, I have another proposal:
>
> - We have 2 thresholds instead of 1 (the current 'max_rows'): a number of
> rows to show *in* a truncated repr, and a max number of rows to show
> without truncating
> - For 'big' dataframes, we show a truncated repr. And then I would go even
> lower than 20 and only show first/last 5 (so like a max_rows of 10)
> - For 'small' dataframes, we show the full dataframe without truncating,
> up to the threshold.
>
> Of course, then the difficulty is to determine what we call 'big' and
> 'small', so what is the threshold to show a tuncated repr (and this part
> will again get more subjective :)).
> But for example, using the current max_rows of 60: we could show a full
> repr up to 60 rows, and once the number of rows > 60, we only show 10
> (first/last 5).
>
> You can then still set both thresholds at the same number (like 20) to not
> get this variable behaviour.
>
> This is actually similar to what numpy arrays do (but with a bigger
> threshold: eg np.random.randn(1000) shows all 1000 elements,
> np.random.randn(1001) shows the first/lst 3).
>
> And it seems this is also what R tibbles do: they have a "print_min" and
"print_max" options with exactly this behaviour, only their "print_max" is
lower (it's 10 and 20, respectively):

options(tibble.print_max = n, tibble.print_min = m): if there are more than
> n rows, print only the first m rows. Use options(tibble.print_max = Inf)
> to always show all rows.
>

(from https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)


> It's just an idea, but I think this might be a way to satisfy more use
> cases at once (and more possibility to fine tune the behaviour).
>
> Joris
>
>
>>
>> Tom
>>
>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180328/a73f42d9/attachment.html>

From jorisvandenbossche at gmail.com  Thu Mar 29 09:29:49 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 29 Mar 2018 15:29:49 +0200
Subject: [Pandas-dev] Welcome Pietro to the core team
Message-ID: <CALQtMBaavHcyF2AMQGys4sctRBsfia8REn7J0vxUhKQrhyhZ8w@mail.gmail.com>

Hi all,

On behalf of the core developers I'd like to welcome Pietro
Battiston (@toobaz) to the core dev team.

Pietro has been active on a regular basis for the last 2 years, and is
especially closely involved in discussions regarding MultiIndexing and did
some nice work on that.

Thanks for all those contributions, and looking forward to further
collaboration and your continued involvement!

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180329/96f0db38/attachment.html>

From wesmckinn at gmail.com  Thu Mar 29 10:53:27 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Thu, 29 Mar 2018 10:53:27 -0400
Subject: [Pandas-dev] Welcome Pietro to the core team
In-Reply-To: <CALQtMBaavHcyF2AMQGys4sctRBsfia8REn7J0vxUhKQrhyhZ8w@mail.gmail.com>
References: <CALQtMBaavHcyF2AMQGys4sctRBsfia8REn7J0vxUhKQrhyhZ8w@mail.gmail.com>
Message-ID: <CAJPUwMCkKa+k7++uciZH0Jtmt4r+X9XMAJs4L8of+9qGPrPVmQ@mail.gmail.com>

Thanks Pietro, and welcome!

On Thu, Mar 29, 2018 at 9:29 AM, Joris Van den Bossche
<jorisvandenbossche at gmail.com> wrote:
> Hi all,
>
> On behalf of the core developers I'd like to welcome Pietro Battiston
> (@toobaz) to the core dev team.
>
> Pietro has been active on a regular basis for the last 2 years, and is
> especially closely involved in discussions regarding MultiIndexing and did
> some nice work on that.
>
> Thanks for all those contributions, and looking forward to further
> collaboration and your continued involvement!
>
> Joris
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>

From gfyoung17 at gmail.com  Thu Mar 29 10:56:19 2018
From: gfyoung17 at gmail.com (G Young)
Date: Thu, 29 Mar 2018 10:56:19 -0400
Subject: [Pandas-dev] Welcome Pietro to the core team
In-Reply-To: <CAJPUwMCkKa+k7++uciZH0Jtmt4r+X9XMAJs4L8of+9qGPrPVmQ@mail.gmail.com>
References: <CALQtMBaavHcyF2AMQGys4sctRBsfia8REn7J0vxUhKQrhyhZ8w@mail.gmail.com>
 <CAJPUwMCkKa+k7++uciZH0Jtmt4r+X9XMAJs4L8of+9qGPrPVmQ@mail.gmail.com>
Message-ID: <CAJ1_J5immqfCCZg1Tu8gzV17yiPu9WzjzRnejJ6AjNTfKBrfeA@mail.gmail.com>

Congrats on joining!  Well deserved.

On Thu, Mar 29, 2018 at 10:53 AM, Wes McKinney <wesmckinn at gmail.com> wrote:

> Thanks Pietro, and welcome!
>
> On Thu, Mar 29, 2018 at 9:29 AM, Joris Van den Bossche
> <jorisvandenbossche at gmail.com> wrote:
> > Hi all,
> >
> > On behalf of the core developers I'd like to welcome Pietro Battiston
> > (@toobaz) to the core dev team.
> >
> > Pietro has been active on a regular basis for the last 2 years, and is
> > especially closely involved in discussions regarding MultiIndexing and
> did
> > some nice work on that.
> >
> > Thanks for all those contributions, and looking forward to further
> > collaboration and your continued involvement!
> >
> > Joris
> >
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> >
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180329/664a3b31/attachment.html>

From jeffreback at gmail.com  Thu Mar 29 19:26:50 2018
From: jeffreback at gmail.com (Jeff Reback)
Date: Thu, 29 Mar 2018 19:26:50 -0400
Subject: [Pandas-dev] Welcome Pietro to the core team
In-Reply-To: <CAJ1_J5immqfCCZg1Tu8gzV17yiPu9WzjzRnejJ6AjNTfKBrfeA@mail.gmail.com>
References: <CALQtMBaavHcyF2AMQGys4sctRBsfia8REn7J0vxUhKQrhyhZ8w@mail.gmail.com>
 <CAJPUwMCkKa+k7++uciZH0Jtmt4r+X9XMAJs4L8of+9qGPrPVmQ@mail.gmail.com>
 <CAJ1_J5immqfCCZg1Tu8gzV17yiPu9WzjzRnejJ6AjNTfKBrfeA@mail.gmail.com>
Message-ID: <1FE05873-CF7A-4517-808B-CA7DB0CE22C0@gmail.com>

welcome to the team!

> On Mar 29, 2018, at 10:56 AM, G Young <gfyoung17 at gmail.com> wrote:
> 
> Congrats on joining!  Well deserved.
> 
>> On Thu, Mar 29, 2018 at 10:53 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>> Thanks Pietro, and welcome!
>> 
>> On Thu, Mar 29, 2018 at 9:29 AM, Joris Van den Bossche
>> <jorisvandenbossche at gmail.com> wrote:
>> > Hi all,
>> >
>> > On behalf of the core developers I'd like to welcome Pietro Battiston
>> > (@toobaz) to the core dev team.
>> >
>> > Pietro has been active on a regular basis for the last 2 years, and is
>> > especially closely involved in discussions regarding MultiIndexing and did
>> > some nice work on that.
>> >
>> > Thanks for all those contributions, and looking forward to further
>> > collaboration and your continued involvement!
>> >
>> > Joris
>> >
>> > _______________________________________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > https://mail.python.org/mailman/listinfo/pandas-dev
>> >
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
> 
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180329/cc8367d4/attachment.html>