From David.Rashty at flagstar.com Thu Sep 13 19:31:55 2018 From: David.Rashty at flagstar.com (David M Rashty) Date: Thu, 13 Sep 2018 23:31:55 +0000 Subject: [Pandas-dev] pandas or new project Message-ID: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM> Dear pandas team, I am a long time Stata user and I started using pandas about a year ago in order to build web applications using an in memory dataframe structure. As a business user, I've found Stata to have a key advantage over pandas that many others have also noted: much faster development time. Examples in Stata: drop myvar* // drops all columns starting with myvar keep myvar* // drops all columns except those starting with myvar reg z y x // runs the regression z = a+bx+cy + error In order to use pandas in a Stata-like fashion, I've had to monkey patch large parts of the library e.g., df = df.sdrop('myvar*') # same as above df = df.skeep('myvar*') # same as above df = df.sreg('z y x') # same as above df = df.squery('a>80 & b.str.contains("hello") & c.isin([1,2,3])') # df.query doesn't support str.contains and isin to my knowledge I put an "s" in front of my methods to mean either "stata" or "sugar". Additionally, I've built a system to: a) Automatically load new DataFrame methods into memory (no additional imports required) b) A caching system to make loading data blazing fast along with a much tighter syntax e.g., pd.read_stata('mydata.dta') (6 secs load time) vs use.mydata (0.001 secs load time after the first read from file) c) A system of column "labels" and formats to prettify various reports e.g., df.sscatter('rate score') produces a scatter plot with labels "Interest Rate, %" and "Credit Score", respectively. d) A reactive web app (using Flask/Redis) to quickly view the full DataFrame content in a browser: [cid:image001.jpg at 01D44B98.6BE52AC0] Basically, I've tried to eliminate any obvious advantages Stata has over pandas. I'm potentially interested in developing this project into something bigger. Would you like me to share my work in the context of pandas or should it be a completely separate project with a different scope? Thanks, David Rashty | Flagstar Bank | Whole Loan Trading | 248-312-6692 | david.rashty at flagstar.com This e-mail may contain data that is confidential, proprietary or non-public personal information, as that term is defined in the Gramm-Leach-Bliley Act (collectively, Confidential Information). The Confidential Information is disclosed conditioned upon your agreement that you will treat it confidentially and in accordance with applicable law, ensure that such data isn't used or disclosed except for the limited purpose for which it's being provided and will notify and cooperate with us regarding any requested or unauthorized disclosure or use of any Confidential Information. By accepting and reviewing the Confidential information, you agree to indemnify us against any losses or expenses, including attorney's fees that we may incur as a result of any unauthorized use or disclosure of this data due to your acts or omissions. If a party other than the intended recipient receives this e-mail, he or she is requested to instantly notify us of the erroneous delivery and return to us all data so delivered. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 179861 bytes Desc: image001.jpg URL: From tom.augspurger88 at gmail.com Thu Sep 13 21:40:34 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 13 Sep 2018 20:40:34 -0500 Subject: [Pandas-dev] pandas or new project In-Reply-To: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM> References: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM> Message-ID: With respect to your `sdrop` and `skeep`, that's the goal of DataFrame.filter, though the name isn't the best so it'll maybe be deprecated in favor of something better. The rest sound interesting, but likely out of scope for pandas. If you build an open source library then we'd be happy to include in pandas' ecosystem page: http://pandas.pydata.org/pandas-docs/stable/ecosystem.html Tom On Thu, Sep 13, 2018 at 7:58 PM David M Rashty wrote: > Dear pandas team, > > I am a long time Stata user and I started using pandas about a year ago in > order to build web applications using an in memory dataframe structure. As > a business user, I?ve found Stata to have a key advantage over pandas that > many others have also noted: much faster development time. Examples in > Stata: > > > > drop myvar* // drops all columns starting with myvar > > keep myvar* // drops all columns except those starting with myvar > > reg z y x // runs the regression z = a+bx+cy + error > > > > In order to use pandas in a Stata-like fashion, I?ve had to monkey patch > large parts of the library e.g., > > > > df = df.sdrop(?myvar*?) # same as above > > df = df.skeep(?myvar*?) # same as above > > df = df.sreg(?z y x?) # same as above > > df = df.squery(?a>80 & b.str.contains(?hello?) & c.isin([1,2,3])?) # > df.query doesn?t support str.contains and isin to my knowledge > > > > I put an ?s? in front of my methods to mean either ?stata? or ?sugar?. > > > > Additionally, I?ve built a system to: > > a) Automatically load new DataFrame methods into memory (no > additional imports required) > > b) A caching system to make loading data blazing fast along with a > much tighter syntax e.g., pd.read_stata(?mydata.dta?) (6 secs load time) vs > use.mydata (0.001 secs load time after the first read from file) > > c) A system of column ?labels? and formats to prettify various > reports e.g., df.sscatter(?rate score?) produces a scatter plot with labels > ?Interest Rate, %? and ?Credit Score?, respectively. > > d) A reactive web app (using Flask/Redis) to quickly view the full > DataFrame content in a browser: > > > > Basically, I?ve tried to eliminate any obvious advantages Stata has over > pandas. > > > > I?m potentially interested in developing this project into something > bigger. Would you like me to share my work in the context of pandas or > should it be a completely separate project with a different scope? > > > > Thanks, > > > > David Rashty | Flagstar Bank | Whole Loan Trading | 248-312-6692 | > david.rashty at flagstar.com > > > This e-mail may contain data that is confidential, proprietary or > non-public personal information, as that term is defined in the > Gramm-Leach-Bliley Act (collectively, Confidential Information). The > Confidential Information is disclosed conditioned upon your agreement that > you will treat it confidentially and in accordance with applicable law, > ensure that such data isn't used or disclosed except for the limited > purpose for which it's being provided and will notify and cooperate with us > regarding any requested or unauthorized disclosure or use of any > Confidential Information. > By accepting and reviewing the Confidential information, you agree to > indemnify us against any losses or expenses, including attorney's fees that > we may incur as a result of any unauthorized use or disclosure of this data > due to your acts or omissions. If a party other than the intended recipient > receives this e-mail, he or she is requested to instantly notify us of the > erroneous delivery and return to us all data so delivered. > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 179861 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 179861 bytes Desc: not available URL: From wesmckinn at gmail.com Thu Sep 13 21:55:56 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 13 Sep 2018 21:55:56 -0400 Subject: [Pandas-dev] pandas or new project In-Reply-To: References: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM> Message-ID: hi David, There's nothing really wrong with injecting a bunch of custom methods into the DataFrame.* namespace. If you wanted, you could release your package as like import pandas_stata and then the new methods would be available. This is pretty common in large corporate environments that use pandas AFAICT. You can also propose your changes in pull requests to pandas. - Wes On Thu, Sep 13, 2018 at 9:41 PM Tom Augspurger wrote: > With respect to your `sdrop` and `skeep`, that's the goal of > DataFrame.filter, though the name isn't the best so it'll > maybe be deprecated in favor of something better. > > The rest sound interesting, but likely out of scope for pandas. If you > build an open source library then we'd be > happy to include in pandas' ecosystem page: > http://pandas.pydata.org/pandas-docs/stable/ecosystem.html > > Tom > > > On Thu, Sep 13, 2018 at 7:58 PM David M Rashty > wrote: > >> Dear pandas team, >> >> I am a long time Stata user and I started using pandas about a year ago >> in order to build web applications using an in memory dataframe structure. >> As a business user, I?ve found Stata to have a key advantage over pandas >> that many others have also noted: much faster development time. Examples >> in Stata: >> >> >> >> drop myvar* // drops all columns starting with myvar >> >> keep myvar* // drops all columns except those starting with myvar >> >> reg z y x // runs the regression z = a+bx+cy + error >> >> >> >> In order to use pandas in a Stata-like fashion, I?ve had to monkey patch >> large parts of the library e.g., >> >> >> >> df = df.sdrop(?myvar*?) # same as above >> >> df = df.skeep(?myvar*?) # same as above >> >> df = df.sreg(?z y x?) # same as above >> >> df = df.squery(?a>80 & b.str.contains(?hello?) & c.isin([1,2,3])?) # >> df.query doesn?t support str.contains and isin to my knowledge >> >> >> >> I put an ?s? in front of my methods to mean either ?stata? or ?sugar?. >> >> >> >> Additionally, I?ve built a system to: >> >> a) Automatically load new DataFrame methods into memory (no >> additional imports required) >> >> b) A caching system to make loading data blazing fast along with a >> much tighter syntax e.g., pd.read_stata(?mydata.dta?) (6 secs load time) vs >> use.mydata (0.001 secs load time after the first read from file) >> >> c) A system of column ?labels? and formats to prettify various >> reports e.g., df.sscatter(?rate score?) produces a scatter plot with labels >> ?Interest Rate, %? and ?Credit Score?, respectively. >> >> d) A reactive web app (using Flask/Redis) to quickly view the full >> DataFrame content in a browser: >> >> >> >> Basically, I?ve tried to eliminate any obvious advantages Stata has over >> pandas. >> >> >> >> I?m potentially interested in developing this project into something >> bigger. Would you like me to share my work in the context of pandas or >> should it be a completely separate project with a different scope? >> >> >> >> Thanks, >> >> >> >> David Rashty | Flagstar Bank | Whole Loan Trading | 248-312-6692 | >> david.rashty at flagstar.com >> >> >> This e-mail may contain data that is confidential, proprietary or >> non-public personal information, as that term is defined in the >> Gramm-Leach-Bliley Act (collectively, Confidential Information). The >> Confidential Information is disclosed conditioned upon your agreement that >> you will treat it confidentially and in accordance with applicable law, >> ensure that such data isn't used or disclosed except for the limited >> purpose for which it's being provided and will notify and cooperate with us >> regarding any requested or unauthorized disclosure or use of any >> Confidential Information. >> By accepting and reviewing the Confidential information, you agree to >> indemnify us against any losses or expenses, including attorney's fees that >> we may incur as a result of any unauthorized use or disclosure of this data >> due to your acts or omissions. If a party other than the intended recipient >> receives this e-mail, he or she is requested to instantly notify us of the >> erroneous delivery and return to us all data so delivered. >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From David.Rashty at flagstar.com Thu Sep 13 22:16:21 2018 From: David.Rashty at flagstar.com (David M Rashty) Date: Fri, 14 Sep 2018 02:16:21 +0000 Subject: [Pandas-dev] [EXTERNAL] Re: pandas or new project In-Reply-To: References: <4bc4feca077b4b56bd14b9a7483c5b7f@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM> Message-ID: <032e7016a7df42e49f5cc40181f3f775@FSTROYMSMAIL04.CORP.FSROOT.FLAGSTAR.COM> Got it. Thanks for the quick response! Just for the record, I?m aware of filter; just found it lacking. The critical aspect of the example is the parameter ?myvar*?, which is equivalent to regex ?^myvar.*?. However, Stata has a special syntax to quickly reference variables in a way that is more intuitive for non-nerds and very effective for exploratory data analysis. For example, skeep(?san* los_ang california-michigan?) means filter on: ? All columns that start with ?san? e.g., san_francisco, san_jose, etc. ? A unique column starting with los_ang (raises an error if not unique or there isn?t a match); note: this isn?t something you?d want to do in application development, but very helpful in exploratory data analysis. ? All columns between california and michigan e.g,. in california is column 5 and michigan is column 10, it will keep columns 5,6,7,8,9,10. Imagine trying to get a business person to implement something like this in pandas? hence people going back to tools like Stata, SAS, JMP, etc., which all have tools to bridge the gap between human thought and machine execution. From: Wes McKinney [mailto:wesmckinn at gmail.com] Sent: Thursday, September 13, 2018 9:56 PM To: Tom Augspurger Cc: David M Rashty ; pandas-dev at python.org Subject: [EXTERNAL] Re: [Pandas-dev] pandas or new project Flagstar Security Warning: External Email. Please make sure you trust this source before clicking links or opening attachments. hi David, There's nothing really wrong with injecting a bunch of custom methods into the DataFrame.* namespace. If you wanted, you could release your package as like import pandas_stata and then the new methods would be available. This is pretty common in large corporate environments that use pandas AFAICT. You can also propose your changes in pull requests to pandas. - Wes On Thu, Sep 13, 2018 at 9:41 PM Tom Augspurger > wrote: With respect to your `sdrop` and `skeep`, that's the goal of DataFrame.filter, though the name isn't the best so it'll maybe be deprecated in favor of something better. The rest sound interesting, but likely out of scope for pandas. If you build an open source library then we'd be happy to include in pandas' ecosystem page: http://pandas.pydata.org/pandas-docs/stable/ecosystem.html Tom On Thu, Sep 13, 2018 at 7:58 PM David M Rashty > wrote: Dear pandas team, I am a long time Stata user and I started using pandas about a year ago in order to build web applications using an in memory dataframe structure. As a business user, I?ve found Stata to have a key advantage over pandas that many others have also noted: much faster development time. Examples in Stata: drop myvar* // drops all columns starting with myvar keep myvar* // drops all columns except those starting with myvar reg z y x // runs the regression z = a+bx+cy + error In order to use pandas in a Stata-like fashion, I?ve had to monkey patch large parts of the library e.g., df = df.sdrop(?myvar*?) # same as above df = df.skeep(?myvar*?) # same as above df = df.sreg(?z y x?) # same as above df = df.squery(?a>80 & b.str.contains(?hello?) & c.isin([1,2,3])?) # df.query doesn?t support str.contains and isin to my knowledge I put an ?s? in front of my methods to mean either ?stata? or ?sugar?. Additionally, I?ve built a system to: a) Automatically load new DataFrame methods into memory (no additional imports required) b) A caching system to make loading data blazing fast along with a much tighter syntax e.g., pd.read_stata(?mydata.dta?) (6 secs load time) vs use.mydata (0.001 secs load time after the first read from file) c) A system of column ?labels? and formats to prettify various reports e.g., df.sscatter(?rate score?) produces a scatter plot with labels ?Interest Rate, %? and ?Credit Score?, respectively. d) A reactive web app (using Flask/Redis) to quickly view the full DataFrame content in a browser: Basically, I?ve tried to eliminate any obvious advantages Stata has over pandas. I?m potentially interested in developing this project into something bigger. Would you like me to share my work in the context of pandas or should it be a completely separate project with a different scope? Thanks, David Rashty | Flagstar Bank | Whole Loan Trading | 248-312-6692 | david.rashty at flagstar.com This e-mail may contain data that is confidential, proprietary or non-public personal information, as that term is defined in the Gramm-Leach-Bliley Act (collectively, Confidential Information). The Confidential Information is disclosed conditioned upon your agreement that you will treat it confidentially and in accordance with applicable law, ensure that such data isn't used or disclosed except for the limited purpose for which it's being provided and will notify and cooperate with us regarding any requested or unauthorized disclosure or use of any Confidential Information. By accepting and reviewing the Confidential information, you agree to indemnify us against any losses or expenses, including attorney's fees that we may incur as a result of any unauthorized use or disclosure of this data due to your acts or omissions. If a party other than the intended recipient receives this e-mail, he or she is requested to instantly notify us of the erroneous delivery and return to us all data so delivered. _______________________________________________ Pandas-dev mailing list Pandas-dev at python.org https://mail.python.org/mailman/listinfo/pandas-dev _______________________________________________ Pandas-dev mailing list Pandas-dev at python.org https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Mon Sep 24 18:27:34 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 25 Sep 2018 00:27:34 +0200 Subject: [Pandas-dev] Pandas development hangout - Thursday September 27 at 17:00 UTC Message-ID: Hi all, We're having a dev chat coming Thursday (September 27) at 9:00 Eastern / 14:00 UTC+1 / 15:00 CEST (Europe). All are welcome to attend. Hangout: https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0 Calendar invite: https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com Agenda/Minutes: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJl Bw5dOkVJLY-licoBmBU/edit?usp=sharing Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Sep 25 02:38:15 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 25 Sep 2018 08:38:15 +0200 Subject: [Pandas-dev] Pandas development hangout - Thursday September 27 at 17:00 UTC In-Reply-To: References: Message-ID: Correction to my previous mail: the title was correct about 17:00 UTC, but of course this corresponds then to 10:00 Pacific / 13:00 Eastern / 18:00 UTC+1 / 18:00 CEST (Europe). Joris 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche < jorisvandenbossche at gmail.com>: > Hi all, > > We're having a dev chat coming Thursday (September 27) at 9:00 Eastern / > 14:00 UTC+1 / 15:00 CEST (Europe). > All are welcome to attend. > > Hangout: https://hangouts.google.com/hangouts/_/calendar/ > am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do? > ijlm=1537828013406&authuser=0 > > > Calendar invite: https://calendar.google.com/event?action=TEMPLATE&tmeid= > NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hl > QG0&tmsrc=jorisvandenbossche%40gmail.com > > > Agenda/Minutes: https://docs.google.com/docume > nt/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing > > Joris > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Sep 25 13:41:00 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 25 Sep 2018 19:41:00 +0200 Subject: [Pandas-dev] Datetime (with timezone?) as extension array? In-Reply-To: References: <1534237179.2549.26.camel@pietrobattiston.it> Message-ID: 2018-08-14 17:32 GMT+02:00 Brock Mendel : > `DatetimeArray` is close to ready if you want to bring it over the finish > line. Pretty much all that has to be done is having `DatetimeArrayMixin` > subclass `ExtensionArray` (and, uh, implement the relevant EA methods). If > no one else picks this up, my current plan is to do this _after_ updating > all of the relevant arithmetic tests to test DatetimeArrayMixin. > > What's the status of this? Asking because I think having a working EA DatetimeArray implementation is important for a 0.24.0 release, and I can imagine it will still take quite some discussion and would be good to have it in master for a while. It's a hard to really steer this since it is volunteer based (and certainly because I currently don't have the time to do it myself), but to the extent possible, it would be good if we could try to prioritize it a bit. Joris > > The unclear part is what `Series[datetime_with_tz].values` should be. > > I thought the conclusion was that `.values` should be non-lossy, in which > case it would have to be the EA. My preference would be for the EA to be > returned for non-tz datetime64[ns] Series too. > > For that matter, I'd like it if `Series.values` _always_ returned an EA, > but we're not there yet. > > > On Tue, Aug 14, 2018 at 4:13 AM, Tom Augspurger < > tom.augspurger88 at gmail.com> wrote: > >> The discussion on datetime with timezone has been a bit scattered. I >> don't think there's a single issue with everyone's thoughts. >> >> There will be a DatetimeWithTZ array that implements the EA interface. >> Anywhere we're internally using a DatetimeIndex as a >> container for datetimes with timezones will use the new EA. >> >> The unclear part is what `Series[datetime_with_tz].values` should be. >> Currently, we convert to UTC, strip the timezone, and return >> a datetime64[ns] ndarray. Changing that would be disruptive, jarringly >> different from `Series[datetime].values` (no tz) and of little >> value I think. >> >> Tom >> >> On Tue, Aug 14, 2018 at 4:07 AM Pietro Battiston >> wrote: >> >>> Hi all, >>> >>> I assumed that Datetime (with timezone, or maybe in general?) was also >>> planned to follow the extension array interface, which is related to >>> issue https://github.com/pandas-dev/pandas/issues/19041 , to the >>> annoying fact that datetimeindexwithtz._values returns the index >>> itself, and also to the fact that >>> https://pandas.pydata.org/pandas-docs/stable/extending.html >>> currently states "Pandas itself uses the extension system for some >>> types that aren?t built into NumPy (categorical, period, interval, >>> datetime with timezone).", which is false. >>> >>> ... but I didn't find an issue for this? Did I miss it? Should I create >>> it? Or was there a decision to leave datetimeindextz as it is, maybe >>> for better compatibility with numpy? >>> >>> Pietro >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Wed Sep 26 07:06:48 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Wed, 26 Sep 2018 06:06:48 -0500 Subject: [Pandas-dev] Datetime (with timezone?) as extension array? In-Reply-To: References: <1534237179.2549.26.camel@pietrobattiston.it> Message-ID: Agreed that all our internal extension arrays should be real EAs for the 0.24.0 release. My plan has been SparseArray, then PeriodArray, then DatetimeTZArray. I'll see what I can prototype before the meeting tomorrow Tom On Tue, Sep 25, 2018 at 12:41 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > 2018-08-14 17:32 GMT+02:00 Brock Mendel : > >> `DatetimeArray` is close to ready if you want to bring it over the finish >> line. Pretty much all that has to be done is having `DatetimeArrayMixin` >> subclass `ExtensionArray` (and, uh, implement the relevant EA methods). If >> no one else picks this up, my current plan is to do this _after_ updating >> all of the relevant arithmetic tests to test DatetimeArrayMixin. >> >> What's the status of this? Asking because I think having a working EA > DatetimeArray implementation is important for a 0.24.0 release, and I can > imagine it will still take quite some discussion and would be good to have > it in master for a while. > > It's a hard to really steer this since it is volunteer based (and > certainly because I currently don't have the time to do it myself), but to > the extent possible, it would be good if we could try to prioritize it a > bit. > > Joris > > > >> > The unclear part is what `Series[datetime_with_tz].values` should be. >> >> I thought the conclusion was that `.values` should be non-lossy, in which >> case it would have to be the EA. My preference would be for the EA to be >> returned for non-tz datetime64[ns] Series too. >> >> For that matter, I'd like it if `Series.values` _always_ returned an EA, >> but we're not there yet. >> >> >> On Tue, Aug 14, 2018 at 4:13 AM, Tom Augspurger < >> tom.augspurger88 at gmail.com> wrote: >> >>> The discussion on datetime with timezone has been a bit scattered. I >>> don't think there's a single issue with everyone's thoughts. >>> >>> There will be a DatetimeWithTZ array that implements the EA interface. >>> Anywhere we're internally using a DatetimeIndex as a >>> container for datetimes with timezones will use the new EA. >>> >>> The unclear part is what `Series[datetime_with_tz].values` should be. >>> Currently, we convert to UTC, strip the timezone, and return >>> a datetime64[ns] ndarray. Changing that would be disruptive, jarringly >>> different from `Series[datetime].values` (no tz) and of little >>> value I think. >>> >>> Tom >>> >>> On Tue, Aug 14, 2018 at 4:07 AM Pietro Battiston >>> wrote: >>> >>>> Hi all, >>>> >>>> I assumed that Datetime (with timezone, or maybe in general?) was also >>>> planned to follow the extension array interface, which is related to >>>> issue https://github.com/pandas-dev/pandas/issues/19041 , to the >>>> annoying fact that datetimeindexwithtz._values returns the index >>>> itself, and also to the fact that >>>> https://pandas.pydata.org/pandas-docs/stable/extending.html >>>> currently states "Pandas itself uses the extension system for some >>>> types that aren?t built into NumPy (categorical, period, interval, >>>> datetime with timezone).", which is false. >>>> >>>> ... but I didn't find an issue for this? Did I miss it? Should I create >>>> it? Or was there a decision to leave datetimeindextz as it is, maybe >>>> for better compatibility with numpy? >>>> >>>> Pietro >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Wed Sep 26 10:43:50 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Wed, 26 Sep 2018 09:43:50 -0500 Subject: [Pandas-dev] Future Deprecation Policy Message-ID: Hi all, At the sprint, we touched on this, but I don't recall there being a whole lot of discussion. I wanted to confirm that we're on the same page. Briefly, I see two options: 1. SemVer Deprecations are introduced as needed. Enforcing a deprecation is a backwards-incompatible change, and so is restricted to major releases only. 2. Rolling deprecations Deprecations are introduced as needed. Deprecations are enforced N releases after they were introduce (N typically being 2-3 "major" release in practice). Do people have a preference between these two schemes? Does the fact that NumPy uses a rolling deprecation policy swing us one way or the other? I've added this to the agenda for the meeting tomorrow, but wanted to give people a chance to collect their thoughts first. Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Wed Sep 26 16:07:35 2018 From: jbrockmendel at gmail.com (Brock Mendel) Date: Wed, 26 Sep 2018 13:07:35 -0700 Subject: [Pandas-dev] Datetime (with timezone?) as extension array? In-Reply-To: References: <1534237179.2549.26.camel@pietrobattiston.it> Message-ID: I've gotten sidetracked on arithmetic ops testing, can try to refocus on the datetimelike EAs following the meeting. On Wed, Sep 26, 2018 at 4:07 AM Tom Augspurger wrote: > Agreed that all our internal extension arrays should be real EAs for the > 0.24.0 release. > > My plan has been SparseArray, then PeriodArray, then DatetimeTZArray. I'll > see what I can prototype before the meeting tomorrow > > Tom > > On Tue, Sep 25, 2018 at 12:41 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> 2018-08-14 17:32 GMT+02:00 Brock Mendel : >> >>> `DatetimeArray` is close to ready if you want to bring it over the >>> finish line. Pretty much all that has to be done is having >>> `DatetimeArrayMixin` subclass `ExtensionArray` (and, uh, implement the >>> relevant EA methods). If no one else picks this up, my current plan is to >>> do this _after_ updating all of the relevant arithmetic tests to test >>> DatetimeArrayMixin. >>> >>> What's the status of this? Asking because I think having a working EA >> DatetimeArray implementation is important for a 0.24.0 release, and I can >> imagine it will still take quite some discussion and would be good to have >> it in master for a while. >> >> It's a hard to really steer this since it is volunteer based (and >> certainly because I currently don't have the time to do it myself), but to >> the extent possible, it would be good if we could try to prioritize it a >> bit. >> >> Joris >> >> >> >>> > The unclear part is what `Series[datetime_with_tz].values` should be. >>> >>> I thought the conclusion was that `.values` should be non-lossy, in >>> which case it would have to be the EA. My preference would be for the EA >>> to be returned for non-tz datetime64[ns] Series too. >>> >>> For that matter, I'd like it if `Series.values` _always_ returned an EA, >>> but we're not there yet. >>> >>> >>> On Tue, Aug 14, 2018 at 4:13 AM, Tom Augspurger < >>> tom.augspurger88 at gmail.com> wrote: >>> >>>> The discussion on datetime with timezone has been a bit scattered. I >>>> don't think there's a single issue with everyone's thoughts. >>>> >>>> There will be a DatetimeWithTZ array that implements the EA interface. >>>> Anywhere we're internally using a DatetimeIndex as a >>>> container for datetimes with timezones will use the new EA. >>>> >>>> The unclear part is what `Series[datetime_with_tz].values` should be. >>>> Currently, we convert to UTC, strip the timezone, and return >>>> a datetime64[ns] ndarray. Changing that would be disruptive, jarringly >>>> different from `Series[datetime].values` (no tz) and of little >>>> value I think. >>>> >>>> Tom >>>> >>>> On Tue, Aug 14, 2018 at 4:07 AM Pietro Battiston >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I assumed that Datetime (with timezone, or maybe in general?) was also >>>>> planned to follow the extension array interface, which is related to >>>>> issue https://github.com/pandas-dev/pandas/issues/19041 , to the >>>>> annoying fact that datetimeindexwithtz._values returns the index >>>>> itself, and also to the fact that >>>>> https://pandas.pydata.org/pandas-docs/stable/extending.html >>>>> currently states "Pandas itself uses the extension system for some >>>>> types that aren?t built into NumPy (categorical, period, interval, >>>>> datetime with timezone).", which is false. >>>>> >>>>> ... but I didn't find an issue for this? Did I miss it? Should I create >>>>> it? Or was there a decision to leave datetimeindextz as it is, maybe >>>>> for better compatibility with numpy? >>>>> >>>>> Pietro >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From gfyoung17 at gmail.com Wed Sep 26 16:09:30 2018 From: gfyoung17 at gmail.com (G Young) Date: Wed, 26 Sep 2018 13:09:30 -0700 Subject: [Pandas-dev] Future Deprecation Policy In-Reply-To: References: Message-ID: https://github.com/pandas-dev/pandas/issues/6581 is a good starting point. To be honest, given how we operate currently with deprecations, I don't really see too much of a difference between your two options (maybe it's the wording?). We generally have done something more akin to rolling, but SemVer sounds like a more generic version of the latter. On Wed, Sep 26, 2018 at 7:44 AM Tom Augspurger wrote: > Hi all, > > At the sprint, we touched on this, but I don't recall there being a whole > lot of > discussion. I wanted to confirm that we're on the same page. > > Briefly, I see two options: > > 1. SemVer > Deprecations are introduced as needed. Enforcing a deprecation is a > backwards-incompatible change, and so is restricted to major releases > only. > > 2. Rolling deprecations > Deprecations are introduced as needed. Deprecations are enforced N > releases > after they were introduce (N typically being 2-3 "major" release in > practice). > > Do people have a preference between these two schemes? Does the fact that > NumPy > uses a rolling deprecation policy swing us one way or the other? > > I've added this to the agenda for the meeting tomorrow, but wanted to give > people a chance to collect their thoughts first. > > Tom > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Sep 26 16:29:09 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 26 Sep 2018 22:29:09 +0200 Subject: [Pandas-dev] Future Deprecation Policy In-Reply-To: References: Message-ID: What I remember is that we somewhat opted for option 1 (SemVer), but it is true that for example numpy (and I think many of the other packages in the pydata ecosystem) uses a more rolling deprecation cycle, and we didn't really take this into account in the discussion. The reason I say to remember option 1 is because I think we talked about the promise of *"code that works in 1.0 should keep working (possibly with deprecation warnings) in the 1.x cycle"* (of course always with some exceptions where deprecations are impossible / people where relying on buggy beahviour) One example to look at is Django's policy: https://docs.djangoproject.com/en/dev/internals/release-process/#internal-release-deprecation-policy, which they describe themselves as a "loose form of semver". This also corresponds more or less to the promise I quoted above I think. Anyway, it is good to further discuss this and maybe take a more formal decision about it. (and also, whatever we decide, I think it would be good to have a page describing our policy as django does) Joris 2018-09-26 16:43 GMT+02:00 Tom Augspurger : > Hi all, > > At the sprint, we touched on this, but I don't recall there being a whole > lot of > discussion. I wanted to confirm that we're on the same page. > > Briefly, I see two options: > > 1. SemVer > Deprecations are introduced as needed. Enforcing a deprecation is a > backwards-incompatible change, and so is restricted to major releases > only. > > 2. Rolling deprecations > Deprecations are introduced as needed. Deprecations are enforced N > releases > after they were introduce (N typically being 2-3 "major" release in > practice). > > Do people have a preference between these two schemes? Does the fact that > NumPy > uses a rolling deprecation policy swing us one way or the other? > > I've added this to the agenda for the meeting tomorrow, but wanted to give > people a chance to collect their thoughts first. > > Tom > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vmehta94 at gmail.com Thu Sep 27 06:49:04 2018 From: vmehta94 at gmail.com (Vinayak Mehta) Date: Thu, 27 Sep 2018 16:19:04 +0530 Subject: [Pandas-dev] Python library to extract tables from PDF files Message-ID: Hello everyone! I'm a software engineer based out of New Delhi, India. I've been a long time user and have used it in countless projects and scripts! Thanks to the core developers and contributors for working on it! I recently released a Python library which lets users extract data tables out of PDF files, my first open-source library! Here's the link: https://github.com/socialcopsdev/camelot It has a similar API to the pandas read_* functions, bearing most similarity to read_html(). Like read_html(), it has a read_pdf() main interface which returns a list of pandas DataFrames for each table found in the PDF file, and contains two flavors for parsing different types of tables! I've created a comparison with other open-source PDF table extraction libraries and tools in the wiki here . I would be really grateful if you could check it out and see if its useful to you, and give me any feedback that may help me improve it, I promise if would take less than 5 minutes of your time! :) To the core devs: I was wondering if pandas would be open to accept this library as a contribution to its read_* interface? The library uses OpenCV's morphological transformations to detect lines in PDFs when flavor='lattice', which I could vendorize or re-implement. It also has two system specific dependencies which are python-tk (used by matplotlib) and ghostscript (used to convert PDF to PNG). The first one shouldn't pose a problem since pandas also uses matplotlib, and for the second one, I could look for a Python library alternative to ghostscript. Looking forward to hearing from you all! Thanks for your time! Vinayak -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu Sep 27 10:43:58 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 27 Sep 2018 16:43:58 +0200 Subject: [Pandas-dev] Indexing API [was: Pandas Sprint Recap] Message-ID: To come back to this old thread. I think when Tom said "We didn't discuss indexing much, beyond agreeing that there's work to be done, and that fixing it was too large a task for 1.0.", he mainly meant things like "fixing __getitem__" ( https://github.com/pandas-dev/pandas/issues/9595), or at least that is how I understood our discussion. And doing that is not only out of scope for 1.0 because it is too much work, but IMO also because it would be a too big API disruption. But I am opening this new thread, as the long discussion in the old thread was triggered by the indexing API questions that Pietro brought up, but then focused more on general pandas 1.0 vs 2.0 vs new code base questions. While I think the initial question about indexing still deserves attention. [Pietro] > There are many bugs - in particular, in indexing code - which might > potentially break existing code when fixed. Some of them will have non- > trivial deprecation paths/detection strategies. The first ones that > come to my mind are #18631 > , #12827 > , #9519 > . The last one, in > particular, > implies changing the result of potentially tons of calls to .loc on a > non-unique index. > That's already a nice list of issues (only the last one might be more controversion I think). Do you think those could be solved without a complete refactor of the indexing code? (because I think we can still try to fix some API warts for 1.0, a complete refactor might be less realistic) If there are other issues, it would be good to try to compile a list of indexing-related issues that we would still like to see fixed for 1.0. Joris 2018-07-13 19:45 GMT+02:00 Tom Augspurger : > Thanks Pietro, > > We didn't discuss indexing much, beyond agreeing that there's work to be > done, and that fixing it was too large > a task for 1.0. > > As for whether an individual issue is a bug or feature, we'll have to > continue using our judgement. I think we'll > inevitably break users' code in a 1.x release as we fix bugs. > > We'll need to discuss workflows for these large changes (e.g. ripping out > the block manager) that will be API > breaking, but may take some time to land. Keeping a separate branch in > sync is a pain, but may be the least > painful alternative. > > One thing I want to reiterate: it's not going to take another 11 years to > reach pandas 2.0 :) Just because we don't > solve indexing for 1.0 doesn't mean we won't ever be able to fix it. > > Tom > > On Fri, Jul 13, 2018 at 12:12 PM, Pietro Battiston > wrote: > >> Hi Tom, >> >> first, thanks to all those who participated in the sprint, and for the >> recap. >> >> Il giorno dom, 08/07/2018 alle 16.26 -0500, Tom Augspurger ha scritto: >> > [...] >> > I've posted a document on our wiki with a summary of the topics >> > discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(J >> > uly,-2018) >> > >> > If people have questions or comments, feel free to post here and >> > we'll clarify that document. >> >> Something that scares me - but maybe because I'm missing something >> obvious - is what exactly qualifies as "deprecation". Is it something >> which was once presented as a distinct feature and is then disabled, or >> any general change to what any API call performs (that is, anything >> requiring a deprecation cycle - that is)? >> >> There are many bugs - in particular, in indexing code - which might >> potentially break existing code when fixed. Some of them will have non- >> trivial deprecation paths/detection strategies. The first ones that >> come to my mind are #18631, #12827, #9519. The last one, in particular, >> implies changing the result of potentially tons of calls to .loc on a >> non-unique index. >> >> My view is that those (and many more, including several that will be >> found) will be best fixed through a total rewrite of indexing code >> (i.e., all code in indexing.py, and some code in internals.py), which I >> assumed would happen before 1.0, and which I certainly won't be able to >> do before 0.24.0 (September 2018). >> I'm clearly not claiming that nobody else can do it (nor that the bugs >> can necessarily only be fixed through a complete rewrite)... but since >> I did not get any feedback on >> https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restruc >> turing-indexing-code >> >> ... I assume that nobody is focusing/planning to focus on this in the >> near future (or was it somehow discussed in the sprint?). >> >> I perfectly understand the desire to stop postponing 1.0 to a vague >> future, if it's just a matter of recognizing that pandas is worth >> using. >> But if it's a statement/commitment about code robustness/quality, and >> relatedly API stability... then I think we it is risky to leave the >> indexing API, and more in general the core codebase (as opposed to >> important but more lateral features such as new dtypes) out of the >> picture (e.g. out of #21894). >> >> Cheers, >> >> Pietro >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Thu Sep 27 13:28:14 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 27 Sep 2018 13:28:14 -0400 Subject: [Pandas-dev] Pandas development hangout - Thursday September 27 at 17:00 UTC In-Reply-To: References: Message-ID: I came late to the call, but it was full. Can someone with a company that uses Google Meet host the next hangout? I think those are not size limited On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche wrote: > > Correction to my previous mail: the title was correct about 17:00 UTC, but of course this corresponds then to 10:00 Pacific / 13:00 Eastern / 18:00 UTC+1 / 18:00 CEST (Europe). > > Joris > > > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche : >> >> Hi all, >> >> We're having a dev chat coming Thursday (September 27) at 9:00 Eastern / 14:00 UTC+1 / 15:00 CEST (Europe). >> All are welcome to attend. >> >> Hangout: https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0 >> >> Calendar invite: https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com >> >> Agenda/Minutes: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing >> >> Joris >> >> >> >> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From garcia.marc at gmail.com Thu Sep 27 13:37:08 2018 From: garcia.marc at gmail.com (Marc Garcia) Date: Thu, 27 Sep 2018 18:37:08 +0100 Subject: [Pandas-dev] Pandas development hangout - Thursday September 27 at 17:00 UTC In-Reply-To: References: Message-ID: Hey Wes, we're moving here: anaconda.webex.com/join/taugspurger On Thu, Sep 27, 2018 at 6:29 PM Wes McKinney wrote: > I came late to the call, but it was full. Can someone with a company > that uses Google Meet host the next hangout? I think those are not > size limited > On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche > wrote: > > > > Correction to my previous mail: the title was correct about 17:00 UTC, > but of course this corresponds then to 10:00 Pacific / 13:00 Eastern / > 18:00 UTC+1 / 18:00 CEST (Europe). > > > > Joris > > > > > > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche < > jorisvandenbossche at gmail.com>: > >> > >> Hi all, > >> > >> We're having a dev chat coming Thursday (September 27) at 9:00 Eastern > / 14:00 UTC+1 / 15:00 CEST (Europe). > >> All are welcome to attend. > >> > >> Hangout: > https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0 > >> > >> Calendar invite: > https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com > >> > >> Agenda/Minutes: > https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing > >> > >> Joris > >> > >> > >> > >> > >> > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeremy.r.schendel at gmail.com Thu Sep 27 13:32:56 2018 From: jeremy.r.schendel at gmail.com (Jeremy Schendel) Date: Thu, 27 Sep 2018 11:32:56 -0600 Subject: [Pandas-dev] Pandas development hangout - Thursday September 27 at 17:00 UTC In-Reply-To: References: Message-ID: +1 I'm having the same issue. On Thu, Sep 27, 2018, 11:29 AM Wes McKinney wrote: > I came late to the call, but it was full. Can someone with a company > that uses Google Meet host the next hangout? I think those are not > size limited > On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche > wrote: > > > > Correction to my previous mail: the title was correct about 17:00 UTC, > but of course this corresponds then to 10:00 Pacific / 13:00 Eastern / > 18:00 UTC+1 / 18:00 CEST (Europe). > > > > Joris > > > > > > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche < > jorisvandenbossche at gmail.com>: > >> > >> Hi all, > >> > >> We're having a dev chat coming Thursday (September 27) at 9:00 Eastern > / 14:00 UTC+1 / 15:00 CEST (Europe). > >> All are welcome to attend. > >> > >> Hangout: > https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0 > >> > >> Calendar invite: > https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com > >> > >> Agenda/Minutes: > https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing > >> > >> Joris > >> > >> > >> > >> > >> > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu Sep 27 15:33:57 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 27 Sep 2018 21:33:57 +0200 Subject: [Pandas-dev] Pandas development hangout - Thursday September 27 at 17:00 UTC In-Reply-To: References: Message-ID: Sorry for the troubles. Annoying now, but somehow also a good sign that we are with many of course .. Any case, need to think about it in advance next time! Notes are still in the same document: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit# 2018-09-27 19:37 GMT+02:00 Marc Garcia : > Hey Wes, > > we're moving here: anaconda.webex.com/join/taugspurger > > On Thu, Sep 27, 2018 at 6:29 PM Wes McKinney wrote: > >> I came late to the call, but it was full. Can someone with a company >> that uses Google Meet host the next hangout? I think those are not >> size limited >> On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche >> wrote: >> > >> > Correction to my previous mail: the title was correct about 17:00 UTC, >> but of course this corresponds then to 10:00 Pacific / 13:00 Eastern / >> 18:00 UTC+1 / 18:00 CEST (Europe). >> > >> > Joris >> > >> > >> > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche < >> jorisvandenbossche at gmail.com>: >> >> >> >> Hi all, >> >> >> >> We're having a dev chat coming Thursday (September 27) at 9:00 Eastern >> / 14:00 UTC+1 / 15:00 CEST (Europe). >> >> All are welcome to attend. >> >> >> >> Hangout: https://hangouts.google.com/hangouts/_/calendar/ >> am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do? >> ijlm=1537828013406&authuser=0 >> >> >> >> Calendar invite: https://calendar.google.com/ >> event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bj >> A0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com >> >> >> >> Agenda/Minutes: https://docs.google.com/document/d/ >> 1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing >> >> >> >> Joris >> >> >> >> >> >> >> >> >> >> >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Thu Sep 27 18:14:42 2018 From: jbrockmendel at gmail.com (Brock Mendel) Date: Thu, 27 Sep 2018 15:14:42 -0700 Subject: [Pandas-dev] Pandas development hangout - Thursday September 27 at 17:00 UTC In-Reply-To: References: Message-ID: I've written up some thoughts on the discussion (and things I couldn't communicate because of audio trouble) I) DatetimeArray/TimedeltaArray/PeriodArray Status The constructors are unfinished. The DatetimeIndex constructors have comments suggesting they be simplified. When writing the Array constructors I only ported the parts I thought non-controversial. The tests are near nonexistent. The game-plan has been to get the arithmetic tests finished, then add DatetimeArray etc to the parameterizations. This has been slowed down by the fact that there are more arithmetic inconsistencies in DataFrame than expected. II) ExtensionArray A) Allow 2D? i) AFAICT none of the currently-implemented EA code actually depends on the 1D restriction ii) A bunch of fragile Block/BlockManager/DataFrame code has to do gymnastics to deal with 1D-only cases. Allowing reshape to (N, 1) would make a lot of that unnecessary. iii) If we intend for EA to be useful in the broader ecosystem (e.g. xarray), it needs to be pretty much a drop-in replacement for ndarray. B) Constructors and Composition vs Inheritance i) `Index` subclasses have `_simple_new` and `Index.__new__` can be used to dispatch to the appropriate Index subclass. Similarly, `Block` subclasses have `make_block` and `internals.blocks.make_block` can be used to dispatch to the appropriate `Block` subclass. ii) Consider the following: - Change `make_block` to follow `_simple_new` semantics/naming, have `Block.__new__` behave analogously to `Index.__new__` - Implement `_simple_new` on pandas' EA subclasses, with a similar `EArray.__new__` dispatch - Define something like @property def _base_constructor(self): return (Index|Block|EArray) - De-duplicate a whole mess of code iii) This would pretty well lock us in to using inheritance On Thu, Sep 27, 2018 at 12:34 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Sorry for the troubles. Annoying now, but somehow also a good sign that we > are with many of course .. Any case, need to think about it in advance next > time! > > Notes are still in the same document: > https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit# > > 2018-09-27 19:37 GMT+02:00 Marc Garcia : > >> Hey Wes, >> >> we're moving here: anaconda.webex.com/join/taugspurger >> >> On Thu, Sep 27, 2018 at 6:29 PM Wes McKinney wrote: >> >>> I came late to the call, but it was full. Can someone with a company >>> that uses Google Meet host the next hangout? I think those are not >>> size limited >>> On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche >>> wrote: >>> > >>> > Correction to my previous mail: the title was correct about 17:00 UTC, >>> but of course this corresponds then to 10:00 Pacific / 13:00 Eastern / >>> 18:00 UTC+1 / 18:00 CEST (Europe). >>> > >>> > Joris >>> > >>> > >>> > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche < >>> jorisvandenbossche at gmail.com>: >>> >> >>> >> Hi all, >>> >> >>> >> We're having a dev chat coming Thursday (September 27) at 9:00 >>> Eastern / 14:00 UTC+1 / 15:00 CEST (Europe). >>> >> All are welcome to attend. >>> >> >>> >> Hangout: >>> https://hangouts.google.com/hangouts/_/calendar/am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do?ijlm=1537828013406&authuser=0 >>> >> >>> >> Calendar invite: >>> https://calendar.google.com/event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bjA0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com >>> >> >>> >> Agenda/Minutes: >>> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing >>> >> >>> >> Joris >>> >> >>> >> >>> >> >>> >> >>> >> >>> > >>> > _______________________________________________ >>> > Pandas-dev mailing list >>> > Pandas-dev at python.org >>> > https://mail.python.org/mailman/listinfo/pandas-dev >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri Sep 28 04:31:43 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 28 Sep 2018 10:31:43 +0200 Subject: [Pandas-dev] Pandas development hangout - Thursday September 27 at 17:00 UTC In-Reply-To: References: Message-ID: Thanks for the notes! 2018-09-28 0:14 GMT+02:00 Brock Mendel : > I've written up some thoughts on the discussion (and things I couldn't > communicate because of audio trouble) > > I) DatetimeArray/TimedeltaArray/PeriodArray Status > The constructors are unfinished. The DatetimeIndex constructors have > comments suggesting they be simplified. When writing the Array > constructors I only ported the parts I thought non-controversial. > Personally, I don't think we should copy over the constructors of the index classes to the arrays. Eg the DatetimeIndex constructor is indeed overly complicated, trying to do partly what date_range and to_datetime already do. I would personally keep the Array constructors very simple (what we typically called _simple_new), and have other constructor methods/functions for those specific cases. > > The tests are near nonexistent. The game-plan has been to get the > arithmetic tests finished, then add DatetimeArray etc to the > parameterizations. This has been slowed down by the fact that there > are more arithmetic inconsistencies in DataFrame than expected. > > > II) ExtensionArray > A) Allow 2D? > i) AFAICT none of the currently-implemented EA code actually > depends > on the 1D restriction > ii) A bunch of fragile Block/BlockManager/DataFrame code has to do > gymnastics to deal with 1D-only cases. Allowing reshape > to (N, 1) would make a lot of that unnecessary. > It certainly creates complexity to have both 1D (non-consolidatable) and 2D blocks. But personally, I find the 2D nature (and then also transposed to what you would expect) also very confusing to work with. A "one column = one 1D array" model seems more attractive to me. > iii) If we intend for EA to be useful in the broader ecosystem > (e.g. xarray), it needs to be pretty much a drop-in > replacement > for ndarray. > This is a good reason (and it would be interesting to hear from that community), and we might need to be careful here (as we already have places where we deviate from numpy semantics in EA). But even then, when an EA supports 2D, we could still store it in a DataFrame as a 1D array. > > B) Constructors and Composition vs Inheritance > i) `Index` subclasses have `_simple_new` and `Index.__new__` can be > used to dispatch to the appropriate Index subclass. > > Similarly, `Block` subclasses have `make_block` and > `internals.blocks.make_block` can be used to dispatch to the > appropriate `Block` subclass. > This is only for the base Index class no? (that it can return any kind of Index subclass) So I don't think this aspect necessarily needs to influence how the actual subclasses are created. > > ii) Consider the following: > - Change `make_block` to follow `_simple_new` semantics/naming, > have `Block.__new__` behave analogously to `Index.__new__` > - Implement `_simple_new` on pandas' EA subclasses, with a > similar > `EArray.__new__` dispatch > - Define something like > > @property > def _base_constructor(self): > return (Index|Block|EArray) > > - De-duplicate a whole mess of code > > iii) This would pretty well lock us in to using inheritance > This might de-duplicate some code, but IMO at the cost of increased complexity. Having both both Index and Array, we are two different objects with different semantics, share actual implementation (instead of sharing via composition) will make it more complex, I think. Personally, I have the feeling that the composition will give use a simpler model to reason about. And dispatching to underlying EA methods can introduce some code overhead, but that could be automated if needed. Joris > > > On Thu, Sep 27, 2018 at 12:34 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Sorry for the troubles. Annoying now, but somehow also a good sign that >> we are with many of course .. Any case, need to think about it in advance >> next time! >> >> Notes are still in the same document: https://docs.google.com/document/d/ >> 1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit# >> >> 2018-09-27 19:37 GMT+02:00 Marc Garcia : >> >>> Hey Wes, >>> >>> we're moving here: anaconda.webex.com/join/taugspurger >>> >>> On Thu, Sep 27, 2018 at 6:29 PM Wes McKinney >>> wrote: >>> >>>> I came late to the call, but it was full. Can someone with a company >>>> that uses Google Meet host the next hangout? I think those are not >>>> size limited >>>> On Tue, Sep 25, 2018 at 2:38 AM Joris Van den Bossche >>>> wrote: >>>> > >>>> > Correction to my previous mail: the title was correct about 17:00 >>>> UTC, but of course this corresponds then to 10:00 Pacific / 13:00 Eastern / >>>> 18:00 UTC+1 / 18:00 CEST (Europe). >>>> > >>>> > Joris >>>> > >>>> > >>>> > 2018-09-25 0:27 GMT+02:00 Joris Van den Bossche < >>>> jorisvandenbossche at gmail.com>: >>>> >> >>>> >> Hi all, >>>> >> >>>> >> We're having a dev chat coming Thursday (September 27) at 9:00 >>>> Eastern / 14:00 UTC+1 / 15:00 CEST (Europe). >>>> >> All are welcome to attend. >>>> >> >>>> >> Hangout: https://hangouts.google.com/hangouts/_/calendar/ >>>> am9yaXN2YW5kZW5ib3NzY2hlQGdtYWlsLmNvbQ.4mvdhb4jukib8ei4nsi4vn04do? >>>> ijlm=1537828013406&authuser=0 >>>> >> >>>> >> Calendar invite: https://calendar.google.com/ >>>> event?action=TEMPLATE&tmeid=NG12ZGhiNGp1a2liOGVpNG5zaTR2bj >>>> A0ZG8gam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com >>>> >> >>>> >> Agenda/Minutes: https://docs.google.com/document/d/ >>>> 1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing >>>> >> >>>> >> Joris >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> > >>>> > _______________________________________________ >>>> > Pandas-dev mailing list >>>> > Pandas-dev at python.org >>>> > https://mail.python.org/mailman/listinfo/pandas-dev >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vmehta94 at gmail.com Fri Sep 28 14:01:54 2018 From: vmehta94 at gmail.com (Vinayak Mehta) Date: Fri, 28 Sep 2018 23:31:54 +0530 Subject: [Pandas-dev] Python library to extract tables from PDF files In-Reply-To: References: Message-ID: I've created a Jupyter notebook which shows an example of how Camelot makes it easy to extract tables out of PDFs. In the example, I scrape a PDF from this disease outbreaks data source[1] using requests, extract tables from each page of the PDF and then concat those tables. Here's the gist! https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873 :) [1] http://idsp.nic.in/index4.php?lang=1&level=0&linkid=406&lid=3689 On Thu, Sep 27, 2018 at 4:19 PM Vinayak Mehta wrote: > Hello everyone! > > I'm a software engineer based out of New Delhi, India. I've been a long > time user and have used it in countless projects and scripts! Thanks to the > core developers and contributors for working on it! > > I recently released a Python library which lets users extract data tables > out of PDF files, my first open-source library! Here's the link: > https://github.com/socialcopsdev/camelot > > It has a similar API to the pandas read_* functions, bearing most > similarity to read_html(). Like read_html(), it has a read_pdf() main > interface which returns a list of pandas DataFrames for each table found in > the PDF file, and contains two flavors for parsing different types of > tables! > > I've created a comparison with other open-source PDF table extraction > libraries and tools in the wiki here > > . > > I would be really grateful if you could check it out and see if its useful > to you, and give me any feedback that may help me improve it, I promise if > would take less than 5 minutes of your time! :) > > To the core devs: I was wondering if pandas would be open to accept this > library as a contribution to its read_* interface? The library uses > OpenCV's morphological transformations to detect lines in PDFs when > flavor='lattice', which I could vendorize or re-implement. It also has two > system specific dependencies which are python-tk (used by matplotlib) and > ghostscript (used to convert PDF to PNG). The first one shouldn't pose a > problem since pandas also uses matplotlib, and for the second one, I could > look for a Python library alternative to ghostscript. > > Looking forward to hearing from you all! > > Thanks for your time! > > Vinayak > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hritikxx8 at gmail.com Sat Sep 29 13:33:20 2018 From: hritikxx8 at gmail.com (Hritik Vijay) Date: Sat, 29 Sep 2018 23:03:20 +0530 Subject: [Pandas-dev] Pandas loc function including upper limit Message-ID: The `loc` function includes the upper limit which is very counter intuitive. Shouldn't it follow iloc and other indexing methods and exclude upper limit (at least for integral slices) -- Regards Hritik Vijay -------------- next part -------------- An HTML attachment was scrubbed... URL: