From David.Rashty at flagstar.com  Thu Sep 13 19:31:55 2018
From: David.Rashty at flagstar.com (David M Rashty)
Date: Thu, 13 Sep 2018 23:31:55 +0000
Subject: [Pandas-dev] pandas or new project
Message-ID: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM>

Dear pandas team,
I am a long time Stata user and I started using pandas about a year ago in order to build web applications using an in memory dataframe structure.  As a business user, I've found Stata to have a key advantage over pandas that many others have also noted: much faster development time.  Examples in Stata:

drop myvar*       // drops all columns starting with myvar
keep myvar*       // drops all columns except those starting with myvar
reg z y x               // runs the regression z = a+bx+cy + error

In order to use pandas in a Stata-like fashion, I've had to monkey patch large parts of the library e.g.,

df = df.sdrop('myvar*')     # same as above
df = df.skeep('myvar*')     # same as above
df = df.sreg('z y x')              # same as above
df = df.squery('a>80 & b.str.contains("hello") & c.isin([1,2,3])')   # df.query doesn't support str.contains and isin to my knowledge

I put an "s" in front of my methods to mean either "stata" or "sugar".

Additionally, I've built a system to:

a)      Automatically load new DataFrame methods into memory (no additional imports required)

b)      A caching system to make loading data blazing fast along with a much tighter syntax e.g., pd.read_stata('mydata.dta') (6 secs load time) vs use.mydata (0.001 secs load time after the first read from file)

c)      A system of column "labels" and formats to prettify various reports e.g., df.sscatter('rate score') produces a scatter plot with labels "Interest Rate, %" and "Credit Score", respectively.

d)      A reactive web app (using Flask/Redis) to quickly view the full DataFrame content in a browser:
[cid:image001.jpg at 01D44B98.6BE52AC0]

Basically, I've tried to eliminate any obvious advantages Stata has over pandas.

I'm potentially interested in developing this project into something bigger.   Would you like me to share my work in the context of pandas or should it be a completely separate project with a different scope?

Thanks,

David Rashty | Flagstar Bank | Whole Loan Trading | 248-312-6692 | david.rashty at flagstar.com<mailto:david.rashty at flagstar.com>

This e-mail may contain data that is confidential, proprietary or non-public personal information, as that term is defined in the Gramm-Leach-Bliley Act (collectively, Confidential Information). The Confidential Information is disclosed conditioned upon your agreement that you will treat it confidentially and in accordance with applicable law, ensure that such data isn't used or disclosed except for the limited purpose for which it's being provided and will notify and cooperate with us regarding any requested or unauthorized disclosure or use of any Confidential Information. 
By accepting and reviewing the Confidential information, you agree to indemnify us against any losses or expenses, including attorney's fees that we may incur as a result of any unauthorized use or disclosure of this data due to your acts or omissions. If a party other than the intended recipient receives this e-mail, he or she is requested to instantly notify us of the erroneous delivery and return to us all data so delivered.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180913/efa3f7d5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 179861 bytes
Desc: image001.jpg
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180913/efa3f7d5/attachment-0001.jpg>

From tom.augspurger88 at gmail.com  Thu Sep 13 21:40:34 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Thu, 13 Sep 2018 20:40:34 -0500
Subject: [Pandas-dev] pandas or new project
In-Reply-To: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM>
References: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM>
Message-ID: <CAE1aY-m0feLwO-zYSs5EiD1BYuT+pe2Mmn1XxfH6yDai5jt2Ng@mail.gmail.com>

With respect to your `sdrop` and `skeep`,  that's the goal of
DataFrame.filter, though the name isn't the best so it'll
maybe be deprecated in favor of something better.

The rest sound interesting, but likely out of scope for pandas. If you
build an open source library then we'd be
happy to include in pandas' ecosystem page:
http://pandas.pydata.org/pandas-docs/stable/ecosystem.html

Tom


On Thu, Sep 13, 2018 at 7:58 PM David M Rashty <David.Rashty at flagstar.com>
wrote:

> Dear pandas team,
>
> I am a long time Stata user and I started using pandas about a year ago in
> order to build web applications using an in memory dataframe structure.  As
> a business user, I?ve found Stata to have a key advantage over pandas that
> many others have also noted: much faster development time.  Examples in
> Stata:
>
>
>
> drop myvar*       // drops all columns starting with myvar
>
> keep myvar*       // drops all columns except those starting with myvar
>
> reg z y x               // runs the regression z = a+bx+cy + error
>
>
>
> In order to use pandas in a Stata-like fashion, I?ve had to monkey patch
> large parts of the library e.g.,
>
>
>
> df = df.sdrop(?myvar*?)     # same as above
>
> df = df.skeep(?myvar*?)     # same as above
>
> df = df.sreg(?z y x?)              # same as above
>
> df = df.squery(?a>80 & b.str.contains(?hello?) & c.isin([1,2,3])?)   #
> df.query doesn?t support str.contains and isin to my knowledge
>
>
>
> I put an ?s? in front of my methods to mean either ?stata? or ?sugar?.
>
>
>
> Additionally, I?ve built a system to:
>
> a)      Automatically load new DataFrame methods into memory (no
> additional imports required)
>
> b)      A caching system to make loading data blazing fast along with a
> much tighter syntax e.g., pd.read_stata(?mydata.dta?) (6 secs load time) vs
> use.mydata (0.001 secs load time after the first read from file)
>
> c)      A system of column ?labels? and formats to prettify various
> reports e.g., df.sscatter(?rate score?) produces a scatter plot with labels
> ?Interest Rate, %? and ?Credit Score?, respectively.
>
> d)      A reactive web app (using Flask/Redis) to quickly view the full
> DataFrame content in a browser:
>
>
>
> Basically, I?ve tried to eliminate any obvious advantages Stata has over
> pandas.
>
>
>
> I?m potentially interested in developing this project into something
> bigger.   Would you like me to share my work in the context of pandas or
> should it be a completely separate project with a different scope?
>
>
>
> Thanks,
>
>
>
> David Rashty | Flagstar Bank | Whole Loan Trading | 248-312-6692 |
> david.rashty at flagstar.com
>
>
> This e-mail may contain data that is confidential, proprietary or
> non-public personal information, as that term is defined in the
> Gramm-Leach-Bliley Act (collectively, Confidential Information). The
> Confidential Information is disclosed conditioned upon your agreement that
> you will treat it confidentially and in accordance with applicable law,
> ensure that such data isn't used or disclosed except for the limited
> purpose for which it's being provided and will notify and cooperate with us
> regarding any requested or unauthorized disclosure or use of any
> Confidential Information.
> By accepting and reviewing the Confidential information, you agree to
> indemnify us against any losses or expenses, including attorney's fees that
> we may incur as a result of any unauthorized use or disclosure of this data
> due to your acts or omissions. If a party other than the intended recipient
> receives this e-mail, he or she is requested to instantly notify us of the
> erroneous delivery and return to us all data so delivered.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180913/50142fb7/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 179861 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180913/50142fb7/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 179861 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180913/50142fb7/attachment-0003.jpg>

From wesmckinn at gmail.com  Thu Sep 13 21:55:56 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Thu, 13 Sep 2018 21:55:56 -0400
Subject: [Pandas-dev] pandas or new project
In-Reply-To: <CAE1aY-m0feLwO-zYSs5EiD1BYuT+pe2Mmn1XxfH6yDai5jt2Ng@mail.gmail.com>
References: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM>
 <CAE1aY-m0feLwO-zYSs5EiD1BYuT+pe2Mmn1XxfH6yDai5jt2Ng@mail.gmail.com>
Message-ID: <CAJPUwMDzZ4i6SEBsSB+swH6ZLb5ynp+fY+qh2su7qgW6wM37ug@mail.gmail.com>

hi David,

There's nothing really wrong with injecting a bunch of custom methods into
the DataFrame.* namespace. If you wanted, you could release your package as
like

import pandas_stata

and then the new methods would be available. This is pretty common in large
corporate environments that use pandas AFAICT. You can also propose your
changes in pull requests to pandas.

- Wes


On Thu, Sep 13, 2018 at 9:41 PM Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

> With respect to your `sdrop` and `skeep`,  that's the goal of
> DataFrame.filter, though the name isn't the best so it'll
> maybe be deprecated in favor of something better.
>
> The rest sound interesting, but likely out of scope for pandas. If you
> build an open source library then we'd be
> happy to include in pandas' ecosystem page:
> http://pandas.pydata.org/pandas-docs/stable/ecosystem.html
>
> Tom
>
>
> On Thu, Sep 13, 2018 at 7:58 PM David M Rashty <David.Rashty at flagstar.com>
> wrote:
>
>> Dear pandas team,
>>
>> I am a long time Stata user and I started using pandas about a year ago
>> in order to build web applications using an in memory dataframe structure.
>> As a business user, I?ve found Stata to have a key advantage over pandas
>> that many others have also noted: much faster development time.  Examples
>> in Stata:
>>
>>
>>
>> drop myvar*       // drops all columns starting with myvar
>>
>> keep myvar*       // drops all columns except those starting with myvar
>>
>> reg z y x               // runs the regression z = a+bx+cy + error
>>
>>
>>
>> In order to use pandas in a Stata-like fashion, I?ve had to monkey patch
>> large parts of the library e.g.,
>>
>>
>>
>> df = df.sdrop(?myvar*?)     # same as above
>>
>> df = df.skeep(?myvar*?)     # same as above
>>
>> df = df.sreg(?z y x?)              # same as above
>>
>> df = df.squery(?a>80 & b.str.contains(?hello?) & c.isin([1,2,3])?)   #
>> df.query doesn?t support str.contains and isin to my knowledge
>>
>>
>>
>> I put an ?s? in front of my methods to mean either ?stata? or ?sugar?.
>>
>>
>>
>> Additionally, I?ve built a system to:
>>
>> a)      Automatically load new DataFrame methods into memory (no
>> additional imports required)
>>
>> b)      A caching system to make loading data blazing fast along with a
>> much tighter syntax e.g., pd.read_stata(?mydata.dta?) (6 secs load time) vs
>> use.mydata (0.001 secs load time after the first read from file)
>>
>> c)      A system of column ?labels? and formats to prettify various
>> reports e.g., df.sscatter(?rate score?) produces a scatter plot with labels
>> ?Interest Rate, %? and ?Credit Score?, respectively.
>>
>> d)      A reactive web app (using Flask/Redis) to quickly view the full
>> DataFrame content in a browser:
>>
>>
>>
>> Basically, I?ve tried to eliminate any obvious advantages Stata has over
>> pandas.
>>
>>
>>
>> I?m potentially interested in developing this project into something
>> bigger.   Would you like me to share my work in the context of pandas or
>> should it be a completely separate project with a different scope?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> David Rashty | Flagstar Bank | Whole Loan Trading | 248-312-6692 |
>> david.rashty at flagstar.com
>>
>>
>> This e-mail may contain data that is confidential, proprietary or
>> non-public personal information, as that term is defined in the
>> Gramm-Leach-Bliley Act (collectively, Confidential Information). The
>> Confidential Information is disclosed conditioned upon your agreement that
>> you will treat it confidentially and in accordance with applicable law,
>> ensure that such data isn't used or disclosed except for the limited
>> purpose for which it's being provided and will notify and cooperate with us
>> regarding any requested or unauthorized disclosure or use of any
>> Confidential Information.
>> By accepting and reviewing the Confidential information, you agree to
>> indemnify us against any losses or expenses, including attorney's fees that
>> we may incur as a result of any unauthorized use or disclosure of this data
>> due to your acts or omissions. If a party other than the intended recipient
>> receives this e-mail, he or she is requested to instantly notify us of the
>> erroneous delivery and return to us all data so delivered.
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180913/7b563c4b/attachment.html>

From David.Rashty at flagstar.com  Thu Sep 13 22:16:21 2018
From: David.Rashty at flagstar.com (David M Rashty)
Date: Fri, 14 Sep 2018 02:16:21 +0000
Subject: [Pandas-dev] [EXTERNAL] Re:  pandas or new project
In-Reply-To: <CAJPUwMDzZ4i6SEBsSB+swH6ZLb5ynp+fY+qh2su7qgW6wM37ug@mail.gmail.com>
References: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM>
 <CAE1aY-m0feLwO-zYSs5EiD1BYuT+pe2Mmn1XxfH6yDai5jt2Ng@mail.gmail.com>
 <CAJPUwMDzZ4i6SEBsSB+swH6ZLb5ynp+fY+qh2su7qgW6wM37ug@mail.gmail.com>
Message-ID: <032e7016a7df42e49f5cc40181f3f775@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM>

Got it.  Thanks for the quick response!

Just for the record, I?m aware of filter; just found it lacking.  The critical aspect of the example is the parameter ?myvar*?, which is equivalent to regex ?^myvar.*?.

However, Stata has a special syntax to quickly reference variables in a way that is more intuitive for non-nerds and very effective for exploratory data analysis.  For example, skeep(?san* los_ang california-michigan?) means filter on:

?        All columns that start with ?san? e.g., san_francisco, san_jose, etc.

?        A unique column starting with los_ang (raises an error if not unique or there isn?t a match); note: this isn?t something you?d want to do in application development, but very helpful in exploratory data analysis.

?        All columns between california and michigan e.g,. in california is column 5 and michigan is column 10, it will keep columns 5,6,7,8,9,10.

Imagine trying to get a business person to implement something like this in pandas? hence people going back to tools like Stata, SAS, JMP, etc., which all have tools to bridge the gap between human thought and machine execution.


From: Wes McKinney [mailto:wesmckinn at gmail.com]
Sent: Thursday, September 13, 2018 9:56 PM
To: Tom Augspurger <tom.augspurger88 at gmail.com>
Cc: David M Rashty <David.Rashty at flagstar.com>; pandas-dev at python.org
Subject: [EXTERNAL] Re: [Pandas-dev] pandas or new project

Flagstar Security Warning: External Email. Please make sure you trust this source before clicking links or opening attachments.
hi David,

There's nothing really wrong with injecting a bunch of custom methods into the DataFrame.* namespace. If you wanted, you could release your package as like

import pandas_stata

and then the new methods would be available. This is pretty common in large corporate environments that use pandas AFAICT. You can also propose your changes in pull requests to pandas.

- Wes


On Thu, Sep 13, 2018 at 9:41 PM Tom Augspurger <tom.augspurger88 at gmail.com<mailto:tom.augspurger88 at gmail.com>> wrote:
With respect to your `sdrop` and `skeep`,  that's the goal of DataFrame.filter, though the name isn't the best so it'll
maybe be deprecated in favor of something better.

The rest sound interesting, but likely out of scope for pandas. If you build an open source library then we'd be
happy to include in pandas' ecosystem page: http://pandas.pydata.org/pandas-docs/stable/ecosystem.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__pandas.pydata.org_pandas-2Ddocs_stable_ecosystem.html&d=DwMFaQ&c=6071WI5hme3qubAgsPInwSFFJUptGl1Ret_NIv4f0FM&r=IInR9ts5zJa2y9TCv1xkCBiNMNvWYuB88s6FL4QdKPQ&m=Yh52B0HOnjdaEtHlGjuSmivYPHIGG_RYsuh0b-93ELY&s=381O1pJzOg_Mvrmgl5CKUUTR9CSFh1VXi5zX4w33Kbc&e=>

Tom


On Thu, Sep 13, 2018 at 7:58 PM David M Rashty <David.Rashty at flagstar.com<mailto:David.Rashty at flagstar.com>> wrote:
Dear pandas team,
I am a long time Stata user and I started using pandas about a year ago in order to build web applications using an in memory dataframe structure.  As a business user, I?ve found Stata to have a key advantage over pandas that many others have also noted: much faster development time.  Examples in Stata:

drop myvar*       // drops all columns starting with myvar
keep myvar*       // drops all columns except those starting with myvar
reg z y x               // runs the regression z = a+bx+cy + error

In order to use pandas in a Stata-like fashion, I?ve had to monkey patch large parts of the library e.g.,

df = df.sdrop(?myvar*?)     # same as above
df = df.skeep(?myvar*?)     # same as above
df = df.sreg(?z y x?)              # same as above
df = df.squery(?a>80 & b.str.contains(?hello?) & c.isin([1,2,3])?)   # df.query doesn?t support str.contains and isin to my knowledge

I put an ?s? in front of my methods to mean either ?stata? or ?sugar?.

Additionally, I?ve built a system to:

a)      Automatically load new DataFrame methods into memory (no additional imports required)

b)      A caching system to make loading data blazing fast along with a much tighter syntax e.g., pd.read_stata(?mydata.dta?) (6 secs load time) vs use.mydata (0.001 secs load time after the first read from file)

c)      A system of column ?labels? and formats to prettify various reports e.g., df.sscatter(?rate score?) produces a scatter plot with labels ?Interest Rate, %? and ?Credit Score?, respectively.

d)      A reactive web app (using Flask/Redis) to quickly view the full DataFrame content in a browser:

Basically, I?ve tried to eliminate any obvious advantages Stata has over pandas.

I?m potentially interested in developing this project into something bigger.   Would you like me to share my work in the context of pandas or should it be a completely separate project with a different scope?

Thanks,

David Rashty | Flagstar Bank | Whole Loan Trading | 248-312-6692 | david.rashty at flagstar.com<mailto:david.rashty at flagstar.com>

This e-mail may contain data that is confidential, proprietary or non-public personal information, as that term is defined in the Gramm-Leach-Bliley Act (collectively, Confidential Information). The Confidential Information is disclosed conditioned upon your agreement that you will treat it confidentially and in accordance with applicable law, ensure that such data isn't used or disclosed except for the limited purpose for which it's being provided and will notify and cooperate with us regarding any requested or unauthorized disclosure or use of any Confidential Information.
By accepting and reviewing the Confidential information, you agree to indemnify us against any losses or expenses, including attorney's fees that we may incur as a result of any unauthorized use or disclosure of this data due to your acts or omissions. If a party other than the intended recipient receives this e-mail, he or she is requested to instantly notify us of the erroneous delivery and return to us all data so delivered.
_______________________________________________
Pandas-dev mailing list
Pandas-dev at python.org<mailto:Pandas-dev at python.org>
https://mail.python.org/mailman/listinfo/pandas-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.python.org_mailman_listinfo_pandas-2Ddev&d=DwMFaQ&c=6071WI5hme3qubAgsPInwSFFJUptGl1Ret_NIv4f0FM&r=IInR9ts5zJa2y9TCv1xkCBiNMNvWYuB88s6FL4QdKPQ&m=Yh52B0HOnjdaEtHlGjuSmivYPHIGG_RYsuh0b-93ELY&s=bLEIk941oO-TPAw9RBlbPeNXj8CTho6oZ91eR_Q9jyI&e=>
_______________________________________________
Pandas-dev mailing list
Pandas-dev at python.org<mailto:Pandas-dev at python.org>
https://mail.python.org/mailman/listinfo/pandas-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.python.org_mailman_listinfo_pandas-2Ddev&d=DwMFaQ&c=6071WI5hme3qubAgsPInwSFFJUptGl1Ret_NIv4f0FM&r=IInR9ts5zJa2y9TCv1xkCBiNMNvWYuB88s6FL4QdKPQ&m=Yh52B0HOnjdaEtHlGjuSmivYPHIGG_RYsuh0b-93ELY&s=bLEIk941oO-TPAw9RBlbPeNXj8CTho6oZ91eR_Q9jyI&e=>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180914/5cd509f8/attachment-0001.html>

From jorisvandenbossche at gmail.com  Mon Sep 24 18:27:34 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 25 Sep 2018 00:27:34 +0200
Subject: [Pandas-dev] Pandas development hangout - Thursday September 27 at
 17:00 UTC
Message-ID: <CALQtMBajx7VskREyVQG1P3WQcqu9eRzWmcBGjHjX9UHD+oboOA@mail.gmail.com>

Hi all,

We're having a dev chat coming Thursday (September 27) at 9:00 Eastern /
14:00 UTC+1 / 15:00 CEST (Europe).
All are welcome to attend.

Hangout:
https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0
<https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.7ft88t77r0mf11ad2kaqnm0ru6?authuser=0>

Calendar invite:
https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com
<https://calendar.google.com/event?action=TEMPLATE&tmeid=N2Z0ODh0NzdyMG1mMTFhZDJrYXFubTBydTYgam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com>

Agenda/Minutes: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJl
Bw5dOkVJLY-licoBmBU/edit?usp=sharing

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180925/d5bbead0/attachment.html>

From jorisvandenbossche at gmail.com  Tue Sep 25 02:38:15 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 25 Sep 2018 08:38:15 +0200
Subject: [Pandas-dev] Pandas development hangout - Thursday September 27
 at 17:00 UTC
In-Reply-To: <CALQtMBajx7VskREyVQG1P3WQcqu9eRzWmcBGjHjX9UHD+oboOA@mail.gmail.com>
References: <CALQtMBajx7VskREyVQG1P3WQcqu9eRzWmcBGjHjX9UHD+oboOA@mail.gmail.com>
Message-ID: <CALQtMBaNoazv=me7P9j_JYck_wsyWOcsjMp7vHQmJ90eh7hMsQ@mail.gmail.com>

Correction to my previous mail: the title was correct about 17:00 UTC, but
of course this corresponds then to 10:00 Pacific / 13:00 Eastern / 18:00
UTC+1 / 18:00 CEST (Europe).

Joris


2018-09-25 0:27 GMT+02:00 Joris Van den Bossche <
jorisvandenbossche at gmail.com>:

> Hi all,
>
> We're having a dev chat coming Thursday (September 27) at 9:00 Eastern /
> 14:00 UTC+1 / 15:00 CEST (Europe).
> All are welcome to attend.
>
> Hangout: https://hangouts.google.com/hangouts/_/calendar/
> am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?
> ijlm=1537828013406&authuser=0
> <https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.7ft88t77r0mf11ad2kaqnm0ru6?authuser=0>
>
> Calendar invite: https://calendar.google.com/event?action=TEMPLATE&tmeid=
> NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hl
> QG0&tmsrc=jorisvandenbossche%40gmail.com
> <https://calendar.google.com/event?action=TEMPLATE&tmeid=N2Z0ODh0NzdyMG1mMTFhZDJrYXFubTBydTYgam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com>
>
> Agenda/Minutes: https://docs.google.com/docume
> nt/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
>
> Joris
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180925/ae12505a/attachment.html>

From jorisvandenbossche at gmail.com  Tue Sep 25 13:41:00 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 25 Sep 2018 19:41:00 +0200
Subject: [Pandas-dev] Datetime (with timezone?) as extension array?
In-Reply-To: <CAKf8g9TnJVqkjUQiCsmT4-M=g-xqV0osd4A2MiPOf1ZheOvyrQ@mail.gmail.com>
References: <1534237179.2549.26.camel@pietrobattiston.it>
 <CAE1aY-k_-XyK5OBzk0h-cF1rwsWNv7WopM6JA2uLZoUWQwbhNQ@mail.gmail.com>
 <CAKf8g9TnJVqkjUQiCsmT4-M=g-xqV0osd4A2MiPOf1ZheOvyrQ@mail.gmail.com>
Message-ID: <CALQtMBYbtvnMY10g2=k_ErT8S243RjN0L58GyQDOGNdunxG7ZA@mail.gmail.com>

2018-08-14 17:32 GMT+02:00 Brock Mendel <jbrockmendel at gmail.com>:

> `DatetimeArray` is close to ready if you want to bring it over the finish
> line.  Pretty much all that has to be done is having `DatetimeArrayMixin`
> subclass `ExtensionArray` (and, uh, implement the relevant EA methods).  If
> no one else picks this up, my current plan is to do this _after_ updating
> all of the relevant arithmetic tests to test DatetimeArrayMixin.
>
> What's the status of this?  Asking because I think having a working EA
DatetimeArray implementation is important for a 0.24.0 release, and I can
imagine it will still take quite some discussion and would be good to have
it in master for a while.

It's a hard to really steer this since it is volunteer based (and certainly
because I currently don't have the time to do it myself), but to the extent
possible, it would be good if we could try to prioritize it a bit.

Joris


> > The unclear part is what `Series[datetime_with_tz].values` should be.
>
> I thought the conclusion was that `.values` should be non-lossy, in which
> case it would have to be the EA.  My preference would be for the EA to be
> returned for non-tz datetime64[ns] Series too.
>
> For that matter, I'd like it if `Series.values` _always_ returned an EA,
> but we're not there yet.
>
>
> On Tue, Aug 14, 2018 at 4:13 AM, Tom Augspurger <
> tom.augspurger88 at gmail.com> wrote:
>
>> The discussion on datetime with timezone has been a bit scattered. I
>> don't think there's a single issue with everyone's thoughts.
>>
>> There will be a DatetimeWithTZ array that implements the EA interface.
>> Anywhere we're internally using a DatetimeIndex as a
>> container for datetimes with timezones will use the new EA.
>>
>> The unclear part is what `Series[datetime_with_tz].values` should be.
>> Currently, we convert to UTC, strip the timezone, and return
>> a datetime64[ns] ndarray. Changing that would be disruptive, jarringly
>> different from `Series[datetime].values` (no tz) and of little
>> value I think.
>>
>> Tom
>>
>> On Tue, Aug 14, 2018 at 4:07 AM Pietro Battiston <me at pietrobattiston.it>
>> wrote:
>>
>>> Hi all,
>>>
>>> I assumed that Datetime (with timezone, or maybe in general?) was also
>>> planned to follow the extension array interface, which is related to
>>> issue https://github.com/pandas-dev/pandas/issues/19041 , to the
>>> annoying fact that datetimeindexwithtz._values returns the index
>>> itself, and also to the fact that
>>> https://pandas.pydata.org/pandas-docs/stable/extending.html
>>> currently states "Pandas itself uses the extension system for some
>>> types that aren?t built into NumPy (categorical, period, interval,
>>> datetime with timezone).", which is false.
>>>
>>> ... but I didn't find an issue for this? Did I miss it? Should I create
>>> it? Or was there a decision to leave datetimeindextz as it is, maybe
>>> for better compatibility with numpy?
>>>
>>> Pietro
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180925/dc5eb539/attachment.html>

From tom.augspurger88 at gmail.com  Wed Sep 26 07:06:48 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Wed, 26 Sep 2018 06:06:48 -0500
Subject: [Pandas-dev] Datetime (with timezone?) as extension array?
In-Reply-To: <CALQtMBYbtvnMY10g2=k_ErT8S243RjN0L58GyQDOGNdunxG7ZA@mail.gmail.com>
References: <1534237179.2549.26.camel@pietrobattiston.it>
 <CAE1aY-k_-XyK5OBzk0h-cF1rwsWNv7WopM6JA2uLZoUWQwbhNQ@mail.gmail.com>
 <CAKf8g9TnJVqkjUQiCsmT4-M=g-xqV0osd4A2MiPOf1ZheOvyrQ@mail.gmail.com>
 <CALQtMBYbtvnMY10g2=k_ErT8S243RjN0L58GyQDOGNdunxG7ZA@mail.gmail.com>
Message-ID: <CAE1aY-kVmOgA3W4pON=zUayb3=+b_GeYhYHYrzXvoGjAptKyAA@mail.gmail.com>

Agreed that all our internal extension arrays should be real EAs for the
0.24.0 release.

My plan has been SparseArray, then PeriodArray, then DatetimeTZArray. I'll
see what I can prototype before the meeting tomorrow

Tom

On Tue, Sep 25, 2018 at 12:41 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> 2018-08-14 17:32 GMT+02:00 Brock Mendel <jbrockmendel at gmail.com>:
>
>> `DatetimeArray` is close to ready if you want to bring it over the finish
>> line.  Pretty much all that has to be done is having `DatetimeArrayMixin`
>> subclass `ExtensionArray` (and, uh, implement the relevant EA methods).  If
>> no one else picks this up, my current plan is to do this _after_ updating
>> all of the relevant arithmetic tests to test DatetimeArrayMixin.
>>
>> What's the status of this?  Asking because I think having a working EA
> DatetimeArray implementation is important for a 0.24.0 release, and I can
> imagine it will still take quite some discussion and would be good to have
> it in master for a while.
>
> It's a hard to really steer this since it is volunteer based (and
> certainly because I currently don't have the time to do it myself), but to
> the extent possible, it would be good if we could try to prioritize it a
> bit.
>
> Joris
>
>
>
>> > The unclear part is what `Series[datetime_with_tz].values` should be.
>>
>> I thought the conclusion was that `.values` should be non-lossy, in which
>> case it would have to be the EA.  My preference would be for the EA to be
>> returned for non-tz datetime64[ns] Series too.
>>
>> For that matter, I'd like it if `Series.values` _always_ returned an EA,
>> but we're not there yet.
>>
>>
>> On Tue, Aug 14, 2018 at 4:13 AM, Tom Augspurger <
>> tom.augspurger88 at gmail.com> wrote:
>>
>>> The discussion on datetime with timezone has been a bit scattered. I
>>> don't think there's a single issue with everyone's thoughts.
>>>
>>> There will be a DatetimeWithTZ array that implements the EA interface.
>>> Anywhere we're internally using a DatetimeIndex as a
>>> container for datetimes with timezones will use the new EA.
>>>
>>> The unclear part is what `Series[datetime_with_tz].values` should be.
>>> Currently, we convert to UTC, strip the timezone, and return
>>> a datetime64[ns] ndarray. Changing that would be disruptive, jarringly
>>> different from `Series[datetime].values` (no tz) and of little
>>> value I think.
>>>
>>> Tom
>>>
>>> On Tue, Aug 14, 2018 at 4:07 AM Pietro Battiston <me at pietrobattiston.it>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I assumed that Datetime (with timezone, or maybe in general?) was also
>>>> planned to follow the extension array interface, which is related to
>>>> issue https://github.com/pandas-dev/pandas/issues/19041 , to the
>>>> annoying fact that datetimeindexwithtz._values returns the index
>>>> itself, and also to the fact that
>>>> https://pandas.pydata.org/pandas-docs/stable/extending.html
>>>> currently states "Pandas itself uses the extension system for some
>>>> types that aren?t built into NumPy (categorical, period, interval,
>>>> datetime with timezone).", which is false.
>>>>
>>>> ... but I didn't find an issue for this? Did I miss it? Should I create
>>>> it? Or was there a decision to leave datetimeindextz as it is, maybe
>>>> for better compatibility with numpy?
>>>>
>>>> Pietro
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180926/feaeeff9/attachment-0001.html>

From tom.augspurger88 at gmail.com  Wed Sep 26 10:43:50 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Wed, 26 Sep 2018 09:43:50 -0500
Subject: [Pandas-dev] Future Deprecation Policy
Message-ID: <CAE1aY-k2c7_ioxuaFrPZ1yV3v5_9i_ogE=BgAiYVMyJfPH3MAA@mail.gmail.com>

Hi all,

At the sprint, we touched on this, but I don't recall there being a whole
lot of
discussion. I wanted to confirm that we're on the same page.

Briefly, I see two options:

1. SemVer
   Deprecations are introduced as needed. Enforcing a deprecation is a
   backwards-incompatible change, and so is restricted to major releases
only.

2. Rolling deprecations
   Deprecations are introduced as needed. Deprecations are enforced N
releases
   after they were introduce (N typically being 2-3 "major" release in
practice).

Do people have a preference between these two schemes? Does the fact that
NumPy
uses a rolling deprecation policy swing us one way or the other?

I've added this to the agenda for the meeting tomorrow, but wanted to give
people a chance to collect their thoughts first.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180926/f8a0e1fe/attachment.html>

From jbrockmendel at gmail.com  Wed Sep 26 16:07:35 2018
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Wed, 26 Sep 2018 13:07:35 -0700
Subject: [Pandas-dev] Datetime (with timezone?) as extension array?
In-Reply-To: <CAE1aY-kVmOgA3W4pON=zUayb3=+b_GeYhYHYrzXvoGjAptKyAA@mail.gmail.com>
References: <1534237179.2549.26.camel@pietrobattiston.it>
 <CAE1aY-k_-XyK5OBzk0h-cF1rwsWNv7WopM6JA2uLZoUWQwbhNQ@mail.gmail.com>
 <CAKf8g9TnJVqkjUQiCsmT4-M=g-xqV0osd4A2MiPOf1ZheOvyrQ@mail.gmail.com>
 <CALQtMBYbtvnMY10g2=k_ErT8S243RjN0L58GyQDOGNdunxG7ZA@mail.gmail.com>
 <CAE1aY-kVmOgA3W4pON=zUayb3=+b_GeYhYHYrzXvoGjAptKyAA@mail.gmail.com>
Message-ID: <CAKf8g9QJdAieXO3pOg-1fjRs0Wmr26dVSypzDzjaQWATuYFgjg@mail.gmail.com>

I've gotten sidetracked on arithmetic ops testing, can try to refocus on
the datetimelike EAs following the meeting.

On Wed, Sep 26, 2018 at 4:07 AM Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

> Agreed that all our internal extension arrays should be real EAs for the
> 0.24.0 release.
>
> My plan has been SparseArray, then PeriodArray, then DatetimeTZArray. I'll
> see what I can prototype before the meeting tomorrow
>
> Tom
>
> On Tue, Sep 25, 2018 at 12:41 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> 2018-08-14 17:32 GMT+02:00 Brock Mendel <jbrockmendel at gmail.com>:
>>
>>> `DatetimeArray` is close to ready if you want to bring it over the
>>> finish line.  Pretty much all that has to be done is having
>>> `DatetimeArrayMixin` subclass `ExtensionArray` (and, uh, implement the
>>> relevant EA methods).  If no one else picks this up, my current plan is to
>>> do this _after_ updating all of the relevant arithmetic tests to test
>>> DatetimeArrayMixin.
>>>
>>> What's the status of this?  Asking because I think having a working EA
>> DatetimeArray implementation is important for a 0.24.0 release, and I can
>> imagine it will still take quite some discussion and would be good to have
>> it in master for a while.
>>
>> It's a hard to really steer this since it is volunteer based (and
>> certainly because I currently don't have the time to do it myself), but to
>> the extent possible, it would be good if we could try to prioritize it a
>> bit.
>>
>> Joris
>>
>>
>>
>>> > The unclear part is what `Series[datetime_with_tz].values` should be.
>>>
>>> I thought the conclusion was that `.values` should be non-lossy, in
>>> which case it would have to be the EA.  My preference would be for the EA
>>> to be returned for non-tz datetime64[ns] Series too.
>>>
>>> For that matter, I'd like it if `Series.values` _always_ returned an EA,
>>> but we're not there yet.
>>>
>>>
>>> On Tue, Aug 14, 2018 at 4:13 AM, Tom Augspurger <
>>> tom.augspurger88 at gmail.com> wrote:
>>>
>>>> The discussion on datetime with timezone has been a bit scattered. I
>>>> don't think there's a single issue with everyone's thoughts.
>>>>
>>>> There will be a DatetimeWithTZ array that implements the EA interface.
>>>> Anywhere we're internally using a DatetimeIndex as a
>>>> container for datetimes with timezones will use the new EA.
>>>>
>>>> The unclear part is what `Series[datetime_with_tz].values` should be.
>>>> Currently, we convert to UTC, strip the timezone, and return
>>>> a datetime64[ns] ndarray. Changing that would be disruptive, jarringly
>>>> different from `Series[datetime].values` (no tz) and of little
>>>> value I think.
>>>>
>>>> Tom
>>>>
>>>> On Tue, Aug 14, 2018 at 4:07 AM Pietro Battiston <me at pietrobattiston.it>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I assumed that Datetime (with timezone, or maybe in general?) was also
>>>>> planned to follow the extension array interface, which is related to
>>>>> issue https://github.com/pandas-dev/pandas/issues/19041 , to the
>>>>> annoying fact that datetimeindexwithtz._values returns the index
>>>>> itself, and also to the fact that
>>>>> https://pandas.pydata.org/pandas-docs/stable/extending.html
>>>>> currently states "Pandas itself uses the extension system for some
>>>>> types that aren?t built into NumPy (categorical, period, interval,
>>>>> datetime with timezone).", which is false.
>>>>>
>>>>> ... but I didn't find an issue for this? Did I miss it? Should I create
>>>>> it? Or was there a decision to leave datetimeindextz as it is, maybe
>>>>> for better compatibility with numpy?
>>>>>
>>>>> Pietro
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180926/c98893c5/attachment.html>

From gfyoung17 at gmail.com  Wed Sep 26 16:09:30 2018
From: gfyoung17 at gmail.com (G Young)
Date: Wed, 26 Sep 2018 13:09:30 -0700
Subject: [Pandas-dev] Future Deprecation Policy
In-Reply-To: <CAE1aY-k2c7_ioxuaFrPZ1yV3v5_9i_ogE=BgAiYVMyJfPH3MAA@mail.gmail.com>
References: <CAE1aY-k2c7_ioxuaFrPZ1yV3v5_9i_ogE=BgAiYVMyJfPH3MAA@mail.gmail.com>
Message-ID: <CAJ1_J5h2E2TvdH4LW7BNUYY4OWfVkj4S7n_JD_tU2RzUEYXKPw@mail.gmail.com>

https://github.com/pandas-dev/pandas/issues/6581 is a good starting point.

To be honest, given how we operate currently with deprecations, I don't
really see too much of a difference between your two options (maybe it's
the wording?).  We generally have done something more akin to rolling, but
SemVer sounds like a more generic version of the latter.

On Wed, Sep 26, 2018 at 7:44 AM Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

> Hi all,
>
> At the sprint, we touched on this, but I don't recall there being a whole
> lot of
> discussion. I wanted to confirm that we're on the same page.
>
> Briefly, I see two options:
>
> 1. SemVer
>    Deprecations are introduced as needed. Enforcing a deprecation is a
>    backwards-incompatible change, and so is restricted to major releases
> only.
>
> 2. Rolling deprecations
>    Deprecations are introduced as needed. Deprecations are enforced N
> releases
>    after they were introduce (N typically being 2-3 "major" release in
> practice).
>
> Do people have a preference between these two schemes? Does the fact that
> NumPy
> uses a rolling deprecation policy swing us one way or the other?
>
> I've added this to the agenda for the meeting tomorrow, but wanted to give
> people a chance to collect their thoughts first.
>
> Tom
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180926/963d41fd/attachment.html>

From jorisvandenbossche at gmail.com  Wed Sep 26 16:29:09 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Wed, 26 Sep 2018 22:29:09 +0200
Subject: [Pandas-dev] Future Deprecation Policy
In-Reply-To: <CAE1aY-k2c7_ioxuaFrPZ1yV3v5_9i_ogE=BgAiYVMyJfPH3MAA@mail.gmail.com>
References: <CAE1aY-k2c7_ioxuaFrPZ1yV3v5_9i_ogE=BgAiYVMyJfPH3MAA@mail.gmail.com>
Message-ID: <CALQtMBZdkTHTEYZxf_cEDGWzP6XodavWwG4G6RZzFBb-GJyA_w@mail.gmail.com>

What I remember is that we somewhat opted for option 1 (SemVer), but it is
true that for example numpy (and I think many of the other packages in the
pydata ecosystem) uses a more rolling deprecation cycle, and we didn't
really take this into account in the discussion.

The reason I say to remember option 1 is because I think we talked about
the promise of *"code that works in 1.0 should keep working (possibly with
deprecation warnings) in the 1.x cycle"*
(of course always with some exceptions where deprecations are impossible /
people where relying on buggy beahviour)

One example to look at is Django's policy:
https://docs.djangoproject.com/en/dev/internals/release-process/#internal-release-deprecation-policy,
which they describe themselves as a "loose form of semver".
This also corresponds more or less to the promise I quoted above I think.

Anyway, it is good to further discuss this and maybe take a more formal
decision about it.

(and also, whatever we decide, I think it would be good to have a page
describing our policy as django does)

Joris


2018-09-26 16:43 GMT+02:00 Tom Augspurger <tom.augspurger88 at gmail.com>:

> Hi all,
>
> At the sprint, we touched on this, but I don't recall there being a whole
> lot of
> discussion. I wanted to confirm that we're on the same page.
>
> Briefly, I see two options:
>
> 1. SemVer
>    Deprecations are introduced as needed. Enforcing a deprecation is a
>    backwards-incompatible change, and so is restricted to major releases
> only.
>
> 2. Rolling deprecations
>    Deprecations are introduced as needed. Deprecations are enforced N
> releases
>    after they were introduce (N typically being 2-3 "major" release in
> practice).
>
> Do people have a preference between these two schemes? Does the fact that
> NumPy
> uses a rolling deprecation policy swing us one way or the other?
>
> I've added this to the agenda for the meeting tomorrow, but wanted to give
> people a chance to collect their thoughts first.
>
> Tom
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180926/ddaaf89c/attachment-0001.html>

From vmehta94 at gmail.com  Thu Sep 27 06:49:04 2018
From: vmehta94 at gmail.com (Vinayak Mehta)
Date: Thu, 27 Sep 2018 16:19:04 +0530
Subject: [Pandas-dev] Python library to extract tables from PDF files
Message-ID: <CAC7wEY32b+u+xirUUCq9tJFk=P5KB4RcLufZEQgLsDubLNSBoA@mail.gmail.com>

Hello everyone!

I'm a software engineer based out of New Delhi, India. I've been a long
time user and have used it in countless projects and scripts! Thanks to the
core developers and contributors for working on it!

I recently released a Python library which lets users extract data tables
out of PDF files, my first open-source library! Here's the link:
https://github.com/socialcopsdev/camelot

It has a similar API to the pandas read_* functions, bearing most
similarity to read_html(). Like read_html(), it has a read_pdf() main
interface which returns a list of pandas DataFrames for each table found in
the PDF file, and contains two flavors for parsing different types of
tables!

I've created a comparison with other open-source PDF table extraction
libraries and tools in the wiki here
<https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools>
.

I would be really grateful if you could check it out and see if its useful
to you, and give me any feedback that may help me improve it, I promise if
would take less than 5 minutes of your time! :)

To the core devs: I was wondering if pandas would be open to accept this
library as a contribution to its read_* interface? The library uses
OpenCV's morphological transformations to detect lines in PDFs when
flavor='lattice', which I could vendorize or re-implement. It also has two
system specific dependencies which are python-tk (used by matplotlib) and
ghostscript (used to convert PDF to PNG). The first one shouldn't pose a
problem since pandas also uses matplotlib, and for the second one, I could
look for a Python library alternative to ghostscript.

Looking forward to hearing from you all!

Thanks for your time!

Vinayak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180927/b35da5a6/attachment.html>

From jorisvandenbossche at gmail.com  Thu Sep 27 10:43:58 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 27 Sep 2018 16:43:58 +0200
Subject: [Pandas-dev] Indexing API [was: Pandas Sprint Recap]
Message-ID: <CALQtMBbTAr1esrMhFiPN3inQJ85q4nUK+r7aeFjEowZKeSKPTQ@mail.gmail.com>

To come back to this old thread.

I think when Tom said "We didn't discuss indexing much, beyond agreeing
that there's work to be done, and that fixing it was too large a task for
1.0.", he mainly meant things like "fixing __getitem__" (
https://github.com/pandas-dev/pandas/issues/9595), or at least that is how
I understood our discussion.
And doing that is not only out of scope for 1.0 because it is too much
work, but IMO also because it would be a too big API disruption.

But I am opening this new thread, as the long discussion in the old thread
was triggered by the indexing API questions that Pietro brought up, but
then focused more on general pandas 1.0 vs 2.0 vs new code base questions.
While I think the initial question about indexing still deserves attention.

[Pietro]
>
There are many bugs - in particular, in indexing code - which might
> potentially break existing code when fixed. Some of them will have non-
> trivial deprecation paths/detection strategies. The first ones that
> come to my mind are #18631
> <https://github.com/pandas-dev/pandas/issues/18631>, #12827
> <https://github.com/pandas-dev/pandas/issues/12827>, #9519
> <https://github.com/pandas-dev/pandas/issues/9519>. The last one, in
> particular,
> implies changing the result of potentially tons of calls to .loc on a
> non-unique index.
>

That's already a nice list of issues (only the last one might be more
controversion I think).
Do you think those could be solved without a complete refactor of the
indexing code? (because I think we can still try to fix some API warts for
1.0, a complete refactor might be less realistic)

If there are other issues, it would be good to try to compile a list of
indexing-related issues that we would still like to see fixed for 1.0.

Joris


2018-07-13 19:45 GMT+02:00 Tom Augspurger <tom.augspurger88 at gmail.com>:

> Thanks Pietro,
>
> We didn't discuss indexing much, beyond agreeing that there's work to be
> done, and that fixing it was too large
> a task for 1.0.
>
> As for whether an individual issue is a bug or feature, we'll have to
> continue using our judgement. I think we'll
> inevitably break users' code in a 1.x release as we fix bugs.
>
> We'll need to discuss workflows for these large changes (e.g. ripping out
> the block manager) that will be API
> breaking, but may take some time to land. Keeping a separate branch in
> sync is a pain, but may be the least
> painful alternative.
>
> One thing I want to reiterate: it's not going to take another 11 years to
> reach pandas 2.0 :) Just because we don't
> solve indexing for 1.0 doesn't mean we won't ever be able to fix it.
>
> Tom
>
> On Fri, Jul 13, 2018 at 12:12 PM, Pietro Battiston <me at pietrobattiston.it>
> wrote:
>
>> Hi Tom,
>>
>> first, thanks to all those who participated in the sprint, and for the
>> recap.
>>
>> Il giorno dom, 08/07/2018 alle 16.26 -0500, Tom Augspurger ha scritto:
>> > [...]
>> > I've posted a document on our wiki with a summary of the topics
>> > discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(J
>> > uly,-2018)
>> >
>> > If people have questions or comments, feel free to post here and
>> > we'll clarify that document.
>>
>> Something that scares me - but maybe because I'm missing something
>> obvious - is what exactly qualifies as "deprecation". Is it something
>> which was once presented as a distinct feature and is then disabled, or
>> any general change to what any API call performs (that is, anything
>> requiring a deprecation cycle - that is)?
>>
>> There are many bugs - in particular, in indexing code - which might
>> potentially break existing code when fixed. Some of them will have non-
>> trivial deprecation paths/detection strategies. The first ones that
>> come to my mind are #18631, #12827, #9519. The last one, in particular,
>> implies changing the result of potentially tons of calls to .loc on a
>> non-unique index.
>>
>> My view is that those (and many more, including several that will be
>> found) will be best fixed through a total rewrite of indexing code
>> (i.e., all code in indexing.py, and some code in internals.py), which I
>> assumed would happen before 1.0, and which I certainly won't be able to
>> do before 0.24.0 (September 2018).
>> I'm clearly not claiming that nobody else can do it (nor that the bugs
>> can necessarily only be fixed through a complete rewrite)... but since
>> I did not get any feedback on
>> https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restruc
>> turing-indexing-code
>> <https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code>
>> ... I assume that nobody is focusing/planning to focus on this in the
>> near future (or was it somehow discussed in the sprint?).
>>
>> I perfectly understand the desire to stop postponing 1.0 to a vague
>> future, if it's just a matter of recognizing that pandas is worth
>> using.
>> But if it's a statement/commitment about code robustness/quality, and
>> relatedly API stability... then I think we it is risky to leave the
>> indexing API, and more in general the core codebase (as opposed to
>> important but more lateral features such as new dtypes) out of the
>> picture (e.g. out of #21894).
>>
>> Cheers,
>>
>> Pietro
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180927/f6f87807/attachment.html>

From wesmckinn at gmail.com  Thu Sep 27 13:28:14 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Thu, 27 Sep 2018 13:28:14 -0400
Subject: [Pandas-dev] Pandas development hangout - Thursday September 27
 at 17:00 UTC
In-Reply-To: <CALQtMBaNoazv=me7P9j_JYck_wsyWOcsjMp7vHQmJ90eh7hMsQ@mail.gmail.com>
References: <CALQtMBajx7VskREyVQG1P3WQcqu9eRzWmcBGjHjX9UHD+oboOA@mail.gmail.com>
 <CALQtMBaNoazv=me7P9j_JYck_wsyWOcsjMp7vHQmJ90eh7hMsQ@mail.gmail.com>
Message-ID: <CAJPUwMDs02EFLwn+06EG+Q4+ygRcpOFnCxYSOOXTHti8JGymYA@mail.gmail.com>

I came late to the call, but it was full. Can someone with a company
that uses Google Meet host the next hangout? I think those are not
size limited
On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche
<jorisvandenbossche at gmail.com> wrote:
>
> Correction to my previous mail: the title was correct about 17:00 UTC, but of course this corresponds then to 10:00 Pacific / 13:00 Eastern / 18:00 UTC+1 / 18:00 CEST (Europe).
>
> Joris
>
>
> 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche <jorisvandenbossche at gmail.com>:
>>
>> Hi all,
>>
>> We're having a dev chat coming Thursday (September 27) at 9:00 Eastern / 14:00 UTC+1 / 15:00 CEST (Europe).
>> All are welcome to attend.
>>
>> Hangout: https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0
>>
>> Calendar invite: https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com
>>
>> Agenda/Minutes: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
>>
>> Joris
>>
>>
>>
>>
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev

From garcia.marc at gmail.com  Thu Sep 27 13:37:08 2018
From: garcia.marc at gmail.com (Marc Garcia)
Date: Thu, 27 Sep 2018 18:37:08 +0100
Subject: [Pandas-dev] Pandas development hangout - Thursday September 27
 at 17:00 UTC
In-Reply-To: <CAJPUwMDs02EFLwn+06EG+Q4+ygRcpOFnCxYSOOXTHti8JGymYA@mail.gmail.com>
References: <CALQtMBajx7VskREyVQG1P3WQcqu9eRzWmcBGjHjX9UHD+oboOA@mail.gmail.com>
 <CALQtMBaNoazv=me7P9j_JYck_wsyWOcsjMp7vHQmJ90eh7hMsQ@mail.gmail.com>
 <CAJPUwMDs02EFLwn+06EG+Q4+ygRcpOFnCxYSOOXTHti8JGymYA@mail.gmail.com>
Message-ID: <CAEk5N5vYWmELK3b=Db1eZcfA6ocDdG1oGcJ_nZHs9C5wpKGj6Q@mail.gmail.com>

Hey Wes,

we're moving here: anaconda.webex.com/join/taugspurger

On Thu, Sep 27, 2018 at 6:29 PM Wes McKinney <wesmckinn at gmail.com> wrote:

> I came late to the call, but it was full. Can someone with a company
> that uses Google Meet host the next hangout? I think those are not
> size limited
> On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche
> <jorisvandenbossche at gmail.com> wrote:
> >
> > Correction to my previous mail: the title was correct about 17:00 UTC,
> but of course this corresponds then to 10:00 Pacific / 13:00 Eastern /
> 18:00 UTC+1 / 18:00 CEST (Europe).
> >
> > Joris
> >
> >
> > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche <
> jorisvandenbossche at gmail.com>:
> >>
> >> Hi all,
> >>
> >> We're having a dev chat coming Thursday (September 27) at 9:00 Eastern
> / 14:00 UTC+1 / 15:00 CEST (Europe).
> >> All are welcome to attend.
> >>
> >> Hangout:
> https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0
> >>
> >> Calendar invite:
> https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com
> >>
> >> Agenda/Minutes:
> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
> >>
> >> Joris
> >>
> >>
> >>
> >>
> >>
> >
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180927/b2324323/attachment.html>

From jeremy.r.schendel at gmail.com  Thu Sep 27 13:32:56 2018
From: jeremy.r.schendel at gmail.com (Jeremy Schendel)
Date: Thu, 27 Sep 2018 11:32:56 -0600
Subject: [Pandas-dev] Pandas development hangout - Thursday September 27
 at 17:00 UTC
In-Reply-To: <CAJPUwMDs02EFLwn+06EG+Q4+ygRcpOFnCxYSOOXTHti8JGymYA@mail.gmail.com>
References: <CALQtMBajx7VskREyVQG1P3WQcqu9eRzWmcBGjHjX9UHD+oboOA@mail.gmail.com>
 <CALQtMBaNoazv=me7P9j_JYck_wsyWOcsjMp7vHQmJ90eh7hMsQ@mail.gmail.com>
 <CAJPUwMDs02EFLwn+06EG+Q4+ygRcpOFnCxYSOOXTHti8JGymYA@mail.gmail.com>
Message-ID: <CAO4FijiPQjcAoBLDaGyZSxjuFReVYU+GoMZekBaf7DDdHq84Sw@mail.gmail.com>

+1 I'm having the same issue.

On Thu, Sep 27, 2018, 11:29 AM Wes McKinney <wesmckinn at gmail.com> wrote:

> I came late to the call, but it was full. Can someone with a company
> that uses Google Meet host the next hangout? I think those are not
> size limited
> On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche
> <jorisvandenbossche at gmail.com> wrote:
> >
> > Correction to my previous mail: the title was correct about 17:00 UTC,
> but of course this corresponds then to 10:00 Pacific / 13:00 Eastern /
> 18:00 UTC+1 / 18:00 CEST (Europe).
> >
> > Joris
> >
> >
> > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche <
> jorisvandenbossche at gmail.com>:
> >>
> >> Hi all,
> >>
> >> We're having a dev chat coming Thursday (September 27) at 9:00 Eastern
> / 14:00 UTC+1 / 15:00 CEST (Europe).
> >> All are welcome to attend.
> >>
> >> Hangout:
> https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0
> >>
> >> Calendar invite:
> https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com
> >>
> >> Agenda/Minutes:
> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
> >>
> >> Joris
> >>
> >>
> >>
> >>
> >>
> >
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180927/f55645f8/attachment.html>

From jorisvandenbossche at gmail.com  Thu Sep 27 15:33:57 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 27 Sep 2018 21:33:57 +0200
Subject: [Pandas-dev] Pandas development hangout - Thursday September 27
 at 17:00 UTC
In-Reply-To: <CAEk5N5vYWmELK3b=Db1eZcfA6ocDdG1oGcJ_nZHs9C5wpKGj6Q@mail.gmail.com>
References: <CALQtMBajx7VskREyVQG1P3WQcqu9eRzWmcBGjHjX9UHD+oboOA@mail.gmail.com>
 <CALQtMBaNoazv=me7P9j_JYck_wsyWOcsjMp7vHQmJ90eh7hMsQ@mail.gmail.com>
 <CAJPUwMDs02EFLwn+06EG+Q4+ygRcpOFnCxYSOOXTHti8JGymYA@mail.gmail.com>
 <CAEk5N5vYWmELK3b=Db1eZcfA6ocDdG1oGcJ_nZHs9C5wpKGj6Q@mail.gmail.com>
Message-ID: <CALQtMBZwG34eEU2JHvs_=bg-zDDCkPijzHJdNQ5-MoUZxNFKqA@mail.gmail.com>

Sorry for the troubles. Annoying now, but somehow also a good sign that we
are with many of course .. Any case, need to think about it in advance next
time!

Notes are still in the same document:
https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit#

2018-09-27 19:37 GMT+02:00 Marc Garcia <garcia.marc at gmail.com>:

> Hey Wes,
>
> we're moving here: anaconda.webex.com/join/taugspurger
>
> On Thu, Sep 27, 2018 at 6:29 PM Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> I came late to the call, but it was full. Can someone with a company
>> that uses Google Meet host the next hangout? I think those are not
>> size limited
>> On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche
>> <jorisvandenbossche at gmail.com> wrote:
>> >
>> > Correction to my previous mail: the title was correct about 17:00 UTC,
>> but of course this corresponds then to 10:00 Pacific / 13:00 Eastern /
>> 18:00 UTC+1 / 18:00 CEST (Europe).
>> >
>> > Joris
>> >
>> >
>> > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche <
>> jorisvandenbossche at gmail.com>:
>> >>
>> >> Hi all,
>> >>
>> >> We're having a dev chat coming Thursday (September 27) at 9:00 Eastern
>> / 14:00 UTC+1 / 15:00 CEST (Europe).
>> >> All are welcome to attend.
>> >>
>> >> Hangout: https://hangouts.google.com/hangouts/_/calendar/
>> am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?
>> ijlm=1537828013406&authuser=0
>> >>
>> >> Calendar invite: https://calendar.google.com/
>> event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bj
>> A0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com
>> >>
>> >> Agenda/Minutes: https://docs.google.com/document/d/
>> 1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
>> >>
>> >> Joris
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>> > _______________________________________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180927/3b3c367a/attachment-0001.html>

From jbrockmendel at gmail.com  Thu Sep 27 18:14:42 2018
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Thu, 27 Sep 2018 15:14:42 -0700
Subject: [Pandas-dev] Pandas development hangout - Thursday September 27
 at 17:00 UTC
In-Reply-To: <CALQtMBZwG34eEU2JHvs_=bg-zDDCkPijzHJdNQ5-MoUZxNFKqA@mail.gmail.com>
References: <CALQtMBajx7VskREyVQG1P3WQcqu9eRzWmcBGjHjX9UHD+oboOA@mail.gmail.com>
 <CALQtMBaNoazv=me7P9j_JYck_wsyWOcsjMp7vHQmJ90eh7hMsQ@mail.gmail.com>
 <CAJPUwMDs02EFLwn+06EG+Q4+ygRcpOFnCxYSOOXTHti8JGymYA@mail.gmail.com>
 <CAEk5N5vYWmELK3b=Db1eZcfA6ocDdG1oGcJ_nZHs9C5wpKGj6Q@mail.gmail.com>
 <CALQtMBZwG34eEU2JHvs_=bg-zDDCkPijzHJdNQ5-MoUZxNFKqA@mail.gmail.com>
Message-ID: <CAKf8g9TdOkXYjVh_A_F++v4ZqnaRaAxcLsN2pUjJ02AMhx2rYg@mail.gmail.com>

I've written up some thoughts on the discussion (and things I couldn't
communicate because of audio trouble)

I) DatetimeArray/TimedeltaArray/PeriodArray Status
    The constructors are unfinished.  The DatetimeIndex constructors have
    comments suggesting they be simplified.  When writing the Array
    constructors I only ported the parts I thought non-controversial.

    The tests are near nonexistent.  The game-plan has been to get the
    arithmetic tests finished, then add DatetimeArray etc to the
    parameterizations.  This has been slowed down by the fact that there
    are more arithmetic inconsistencies in DataFrame than expected.


II) ExtensionArray
    A) Allow 2D?
        i) AFAICT none of the currently-implemented EA code actually depends
           on the 1D restriction
        ii) A bunch of fragile Block/BlockManager/DataFrame code has to do
            gymnastics to deal with 1D-only cases.  Allowing reshape
            to (N, 1) would make a lot of that unnecessary.
        iii) If we intend for EA to be useful in the broader ecosystem
             (e.g. xarray), it needs to be pretty much a drop-in replacement
             for ndarray.

    B) Constructors and Composition vs Inheritance
        i) `Index` subclasses have `_simple_new` and `Index.__new__` can be
           used to dispatch to the appropriate Index subclass.

           Similarly, `Block` subclasses have `make_block` and
           `internals.blocks.make_block` can be used to dispatch to the
           appropriate `Block` subclass.

        ii) Consider the following:
            - Change `make_block` to follow `_simple_new` semantics/naming,
               have `Block.__new__` behave analogously to `Index.__new__`
            - Implement `_simple_new` on pandas' EA subclasses, with a
similar
              `EArray.__new__` dispatch
            - Define something like

              @property
              def _base_constructor(self):
                  return (Index|Block|EArray)

            - De-duplicate a whole mess of code

        iii) This would pretty well lock us in to using inheritance


On Thu, Sep 27, 2018 at 12:34 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Sorry for the troubles. Annoying now, but somehow also a good sign that we
> are with many of course .. Any case, need to think about it in advance next
> time!
>
> Notes are still in the same document:
> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit#
>
> 2018-09-27 19:37 GMT+02:00 Marc Garcia <garcia.marc at gmail.com>:
>
>> Hey Wes,
>>
>> we're moving here: anaconda.webex.com/join/taugspurger
>>
>> On Thu, Sep 27, 2018 at 6:29 PM Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>>> I came late to the call, but it was full. Can someone with a company
>>> that uses Google Meet host the next hangout? I think those are not
>>> size limited
>>> On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche
>>> <jorisvandenbossche at gmail.com> wrote:
>>> >
>>> > Correction to my previous mail: the title was correct about 17:00 UTC,
>>> but of course this corresponds then to 10:00 Pacific / 13:00 Eastern /
>>> 18:00 UTC+1 / 18:00 CEST (Europe).
>>> >
>>> > Joris
>>> >
>>> >
>>> > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com>:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> We're having a dev chat coming Thursday (September 27) at 9:00
>>> Eastern / 14:00 UTC+1 / 15:00 CEST (Europe).
>>> >> All are welcome to attend.
>>> >>
>>> >> Hangout:
>>> https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0
>>> >>
>>> >> Calendar invite:
>>> https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com
>>> >>
>>> >> Agenda/Minutes:
>>> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
>>> >>
>>> >> Joris
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >
>>> > _______________________________________________
>>> > Pandas-dev mailing list
>>> > Pandas-dev at python.org
>>> > https://mail.python.org/mailman/listinfo/pandas-dev
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180927/7a7e50c7/attachment.html>

From jorisvandenbossche at gmail.com  Fri Sep 28 04:31:43 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Fri, 28 Sep 2018 10:31:43 +0200
Subject: [Pandas-dev] Pandas development hangout - Thursday September 27
 at 17:00 UTC
In-Reply-To: <CAKf8g9TdOkXYjVh_A_F++v4ZqnaRaAxcLsN2pUjJ02AMhx2rYg@mail.gmail.com>
References: <CALQtMBajx7VskREyVQG1P3WQcqu9eRzWmcBGjHjX9UHD+oboOA@mail.gmail.com>
 <CALQtMBaNoazv=me7P9j_JYck_wsyWOcsjMp7vHQmJ90eh7hMsQ@mail.gmail.com>
 <CAJPUwMDs02EFLwn+06EG+Q4+ygRcpOFnCxYSOOXTHti8JGymYA@mail.gmail.com>
 <CAEk5N5vYWmELK3b=Db1eZcfA6ocDdG1oGcJ_nZHs9C5wpKGj6Q@mail.gmail.com>
 <CALQtMBZwG34eEU2JHvs_=bg-zDDCkPijzHJdNQ5-MoUZxNFKqA@mail.gmail.com>
 <CAKf8g9TdOkXYjVh_A_F++v4ZqnaRaAxcLsN2pUjJ02AMhx2rYg@mail.gmail.com>
Message-ID: <CALQtMBYVL8by3C8jgr1FcLQ45hZbc3vBn6SxLD3cx2K0mP_oqw@mail.gmail.com>

Thanks for the notes!

2018-09-28 0:14 GMT+02:00 Brock Mendel <jbrockmendel at gmail.com>:

> I've written up some thoughts on the discussion (and things I couldn't
> communicate because of audio trouble)
>
> I) DatetimeArray/TimedeltaArray/PeriodArray Status
>     The constructors are unfinished.  The DatetimeIndex constructors have
>     comments suggesting they be simplified.  When writing the Array
>     constructors I only ported the parts I thought non-controversial.
>

Personally, I don't think we should copy over the constructors of the index
classes to the arrays. Eg the DatetimeIndex constructor is indeed overly
complicated, trying to do partly what date_range and to_datetime already
do.
I would personally keep the Array constructors very simple (what we
typically called _simple_new), and have other constructor methods/functions
for those specific cases.


>
>     The tests are near nonexistent.  The game-plan has been to get the
>     arithmetic tests finished, then add DatetimeArray etc to the
>     parameterizations.  This has been slowed down by the fact that there
>     are more arithmetic inconsistencies in DataFrame than expected.
>
>
> II) ExtensionArray
>     A) Allow 2D?
>         i) AFAICT none of the currently-implemented EA code actually
> depends
>            on the 1D restriction
>         ii) A bunch of fragile Block/BlockManager/DataFrame code has to do
>             gymnastics to deal with 1D-only cases.  Allowing reshape
>             to (N, 1) would make a lot of that unnecessary.
>

It certainly creates complexity to have both 1D (non-consolidatable) and 2D
blocks. But personally, I find the 2D nature (and then also transposed to
what you would expect) also very confusing to work with. A "one column =
one 1D array" model seems more attractive to me.


>         iii) If we intend for EA to be useful in the broader ecosystem
>              (e.g. xarray), it needs to be pretty much a drop-in
> replacement
>              for ndarray.
>

This is a good reason (and it would be interesting to hear from that
community), and we might need to be careful here (as we already have places
where we deviate from numpy semantics in EA).
But even then, when an EA supports 2D, we could still store it in a
DataFrame as a 1D array.


>
>     B) Constructors and Composition vs Inheritance
>         i) `Index` subclasses have `_simple_new` and `Index.__new__` can be
>            used to dispatch to the appropriate Index subclass.
>
>            Similarly, `Block` subclasses have `make_block` and
>            `internals.blocks.make_block` can be used to dispatch to the
>            appropriate `Block` subclass.
>

This is only for the base Index class no? (that it can return any kind of
Index subclass) So I don't think this aspect necessarily needs to influence
how the actual subclasses are created.


>
>         ii) Consider the following:
>             - Change `make_block` to follow `_simple_new` semantics/naming,
>                have `Block.__new__` behave analogously to `Index.__new__`
>             - Implement `_simple_new` on pandas' EA subclasses, with a
> similar
>               `EArray.__new__` dispatch
>
            - Define something like
>
>               @property
>               def _base_constructor(self):
>                   return (Index|Block|EArray)
>
>             - De-duplicate a whole mess of code
>
>         iii) This would pretty well lock us in to using inheritance
>

This might de-duplicate some code, but IMO at the cost of increased
complexity. Having both both Index and Array, we are two different objects
with different semantics, share actual implementation (instead of sharing
via composition) will make it more complex, I think.
Personally, I have the feeling that the composition will give use a simpler
model to reason about. And dispatching to underlying EA methods can
introduce some code overhead, but that could be automated if needed.

Joris


>
>
> On Thu, Sep 27, 2018 at 12:34 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Sorry for the troubles. Annoying now, but somehow also a good sign that
>> we are with many of course .. Any case, need to think about it in advance
>> next time!
>>
>> Notes are still in the same document: https://docs.google.com/document/d/
>> 1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit#
>>
>> 2018-09-27 19:37 GMT+02:00 Marc Garcia <garcia.marc at gmail.com>:
>>
>>> Hey Wes,
>>>
>>> we're moving here: anaconda.webex.com/join/taugspurger
>>>
>>> On Thu, Sep 27, 2018 at 6:29 PM Wes McKinney <wesmckinn at gmail.com>
>>> wrote:
>>>
>>>> I came late to the call, but it was full. Can someone with a company
>>>> that uses Google Meet host the next hangout? I think those are not
>>>> size limited
>>>> On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche
>>>> <jorisvandenbossche at gmail.com> wrote:
>>>> >
>>>> > Correction to my previous mail: the title was correct about 17:00
>>>> UTC, but of course this corresponds then to 10:00 Pacific / 13:00 Eastern /
>>>> 18:00 UTC+1 / 18:00 CEST (Europe).
>>>> >
>>>> > Joris
>>>> >
>>>> >
>>>> > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche <
>>>> jorisvandenbossche at gmail.com>:
>>>> >>
>>>> >> Hi all,
>>>> >>
>>>> >> We're having a dev chat coming Thursday (September 27) at 9:00
>>>> Eastern / 14:00 UTC+1 / 15:00 CEST (Europe).
>>>> >> All are welcome to attend.
>>>> >>
>>>> >> Hangout: https://hangouts.google.com/hangouts/_/calendar/
>>>> am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?
>>>> ijlm=1537828013406&authuser=0
>>>> >>
>>>> >> Calendar invite: https://calendar.google.com/
>>>> event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bj
>>>> A0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com
>>>> >>
>>>> >> Agenda/Minutes: https://docs.google.com/document/d/
>>>> 1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
>>>> >>
>>>> >> Joris
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >
>>>> > _______________________________________________
>>>> > Pandas-dev mailing list
>>>> > Pandas-dev at python.org
>>>> > https://mail.python.org/mailman/listinfo/pandas-dev
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180928/49a2df55/attachment-0001.html>

From vmehta94 at gmail.com  Fri Sep 28 14:01:54 2018
From: vmehta94 at gmail.com (Vinayak Mehta)
Date: Fri, 28 Sep 2018 23:31:54 +0530
Subject: [Pandas-dev] Python library to extract tables from PDF files
In-Reply-To: <CAC7wEY32b+u+xirUUCq9tJFk=P5KB4RcLufZEQgLsDubLNSBoA@mail.gmail.com>
References: <CAC7wEY32b+u+xirUUCq9tJFk=P5KB4RcLufZEQgLsDubLNSBoA@mail.gmail.com>
Message-ID: <CAC7wEY2L+PfAE5dEUx9JO_8KOvYhVNhOvtYp0R_=1-Ae_RoD_Q@mail.gmail.com>

I've created a Jupyter notebook which shows an example of how Camelot makes
it easy to extract tables out of PDFs. In the example, I scrape a PDF from
this disease outbreaks data source[1] using requests, extract tables from
each page of the PDF and then concat those tables. Here's the gist!
https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873 :)

[1] http://idsp.nic.in/index4.php?lang=1&level=0&linkid=406&lid=3689

On Thu, Sep 27, 2018 at 4:19 PM Vinayak Mehta <vmehta94 at gmail.com> wrote:

> Hello everyone!
>
> I'm a software engineer based out of New Delhi, India. I've been a long
> time user and have used it in countless projects and scripts! Thanks to the
> core developers and contributors for working on it!
>
> I recently released a Python library which lets users extract data tables
> out of PDF files, my first open-source library! Here's the link:
> https://github.com/socialcopsdev/camelot
>
> It has a similar API to the pandas read_* functions, bearing most
> similarity to read_html(). Like read_html(), it has a read_pdf() main
> interface which returns a list of pandas DataFrames for each table found in
> the PDF file, and contains two flavors for parsing different types of
> tables!
>
> I've created a comparison with other open-source PDF table extraction
> libraries and tools in the wiki here
> <https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools>
> .
>
> I would be really grateful if you could check it out and see if its useful
> to you, and give me any feedback that may help me improve it, I promise if
> would take less than 5 minutes of your time! :)
>
> To the core devs: I was wondering if pandas would be open to accept this
> library as a contribution to its read_* interface? The library uses
> OpenCV's morphological transformations to detect lines in PDFs when
> flavor='lattice', which I could vendorize or re-implement. It also has two
> system specific dependencies which are python-tk (used by matplotlib) and
> ghostscript (used to convert PDF to PNG). The first one shouldn't pose a
> problem since pandas also uses matplotlib, and for the second one, I could
> look for a Python library alternative to ghostscript.
>
> Looking forward to hearing from you all!
>
> Thanks for your time!
>
> Vinayak
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180928/64642303/attachment.html>

From hritikxx8 at gmail.com  Sat Sep 29 13:33:20 2018
From: hritikxx8 at gmail.com (Hritik Vijay)
Date: Sat, 29 Sep 2018 23:03:20 +0530
Subject: [Pandas-dev] Pandas loc function including upper limit
Message-ID: <CACncA5Ki9B_1vm5v38pk6Gh9j=ob1ORstYhWiof469KRZ8k1Ow@mail.gmail.com>

The `loc` function includes the upper limit which is very counter intuitive.
Shouldn't it follow iloc and other indexing methods and exclude upper limit
(at least for integral slices)

-- 
Regards
Hritik Vijay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180929/176c3338/attachment.html>