From jorisvandenbossche at gmail.com Fri Nov 2 06:08:51 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 2 Nov 2018 11:08:51 +0100 Subject: [Pandas-dev] Python library to extract tables from PDF files In-Reply-To: References: Message-ID: Hi Vinayak, Thanks for mentioning this package on the list! Camelot looks like a really useful package to me (I needed to extract some data out of a pdf last week, so I was able to give it a try, and it did exactly what I wanted :)). Regarding your question about adding it to pandas itself: personally I am a bit hesitant to further broaden the scope of what is already in pandas, although in this case it would mainly be calling the external package (but which has quite some dependencies). I think it is also nice to have a good ecosystem of packages that provide additional IO functionality, but then we should do a better job advertising them. In any case, I think it would already be a good first step to list the package on ecosystem page in the docs: http://pandas.pydata.org/pandas-docs/stable/ecosystem.html (regardless of the above discussion). And we could maybe also have a section on additional formats on the IO page. PR very welcome for that! Best, Joris Op vr 28 sep. 2018 om 20:02 schreef Vinayak Mehta : > I've created a Jupyter notebook which shows an example of how Camelot > makes it easy to extract tables out of PDFs. In the example, I scrape a PDF > from this disease outbreaks data source[1] using requests, extract tables > from each page of the PDF and then concat those tables. Here's the gist! > https://gist.github.com/vinayak-mehta/e5949f7c2410a0e12f25d3682dc9e873 :) > > [1] http://idsp.nic.in/index4.php?lang=1&level=0&linkid=406&lid=3689 > > On Thu, Sep 27, 2018 at 4:19 PM Vinayak Mehta wrote: > >> Hello everyone! >> >> I'm a software engineer based out of New Delhi, India. I've been a long >> time user and have used it in countless projects and scripts! Thanks to the >> core developers and contributors for working on it! >> >> I recently released a Python library which lets users extract data tables >> out of PDF files, my first open-source library! Here's the link: >> https://github.com/socialcopsdev/camelot >> >> It has a similar API to the pandas read_* functions, bearing most >> similarity to read_html(). Like read_html(), it has a read_pdf() main >> interface which returns a list of pandas DataFrames for each table found in >> the PDF file, and contains two flavors for parsing different types of >> tables! >> >> I've created a comparison with other open-source PDF table extraction >> libraries and tools in the wiki here >> >> . >> >> I would be really grateful if you could check it out and see if its >> useful to you, and give me any feedback that may help me improve it, I >> promise if would take less than 5 minutes of your time! :) >> >> To the core devs: I was wondering if pandas would be open to accept this >> library as a contribution to its read_* interface? The library uses >> OpenCV's morphological transformations to detect lines in PDFs when >> flavor='lattice', which I could vendorize or re-implement. It also has two >> system specific dependencies which are python-tk (used by matplotlib) and >> ghostscript (used to convert PDF to PNG). The first one shouldn't pose a >> problem since pandas also uses matplotlib, and for the second one, I could >> look for a Python library alternative to ghostscript. >> >> Looking forward to hearing from you all! >> >> Thanks for your time! >> >> Vinayak >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Nov 7 19:57:50 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 8 Nov 2018 01:57:50 +0100 Subject: [Pandas-dev] Pandas development hangout - Thursday November 8 at 18:30 UTC In-Reply-To: References: Message-ID: Hi all, We're having a dev chat tomorrow Thursday (November 8) at 18:30 UTC (I *think* this is 10:30 Pacific / 13:30 Eastern / 18:00 UTC / 19:30 CET (Europe). All are welcome to attend. We didn't update the agenda yet, but we will probably discuss progress towards 0.24.0 and potential blockers for that. Hangout: https://meet.google.com/tii-twqi-sco Calendar invite: https://calendar.google.com/event?action=TEMPLATE&tmeid=N3ZlMjlxaGU4NGh1bjJwMGdkamN2dDZqaGwgam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com Agenda/Minutes: https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Thu Nov 8 02:50:08 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Thu, 08 Nov 2018 08:50:08 +0100 Subject: [Pandas-dev] Pandas development hangout - Thursday November 8 at 18:30 UTC In-Reply-To: References: Message-ID: <1541663408.18453.288.camel@pietrobattiston.it> Il giorno gio, 08/11/2018 alle 01.57 +0100, Joris Van den Bossche ha scritto: > Hi all, > > We're having a dev chat tomorrow Thursday (November 8) at 18:30 UTC > (I think this is [...] / 18:00 UTC Shall I file a bug with label "Timeseries"? (Or just hope nobody lives in the UTC timezone? :-D ) Pietro From tom.augspurger88 at gmail.com Thu Nov 8 07:08:39 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 8 Nov 2018 06:08:39 -0600 Subject: [Pandas-dev] Pandas development hangout - Thursday November 8 at 18:30 UTC In-Reply-To: References: Message-ID: If you need to call in to the meeting, I think this number will work. +1 720-443-5822 PIN: 186953630 Tom On Wed, Nov 7, 2018 at 6:58 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi all, > > We're having a dev chat tomorrow Thursday (November 8) at 18:30 UTC (I > *think* this is 10:30 Pacific / 13:30 Eastern / 18:00 UTC / 19:30 CET > (Europe). > All are welcome to attend. > > We didn't update the agenda yet, but we will probably discuss progress > towards 0.24.0 and potential blockers for that. > > Hangout: https://meet.google.com/tii-twqi-sco > > Calendar invite: > https://calendar.google.com/event?action=TEMPLATE&tmeid=N3ZlMjlxaGU4NGh1bjJwMGdkamN2dDZqaGwgam9yaXN2YW5kZW5ib3NzY2hlQG0&tmsrc=jorisvandenbossche%40gmail.com > > > Agenda/Minutes: > https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing > > Joris > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri Nov 16 15:07:58 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 16 Nov 2018 21:07:58 +0100 Subject: [Pandas-dev] Sparse data structures in pandas: refactor - feedback welcome! Message-ID: Hi all, Distributing the message that Tom put on twitter ( https://twitter.com/TomAugspurger/status/1062718319445213184) also on the mailing lists: we are making changes to the support for sparse data in pandas, and would like to get feedback on this. To give some context: part of the internals of pandas are getting refactored based on the ExtensionArrays. This also applies to the sparse data structures: - The SparseArray is refactored to follow the ExtensionArray protocol, and this has some consequences (also impacting SparseSeries and Series holding sparse data): no longer subclassing numpy.ndarray, change in `np.asarray` behaviour, ... For more details see http://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#sparse-data-structure-refactor . - Since a normal pandas Series and DataFrame can hold sparse data, there may be no need for the dedicated SparseSeries and SparseDataFrame subclasses. Therefore, we are planning to deprecate those subclasses, and the specific sparse functionality will be accessible on normal Series/DataFrame with the `sparse` accessor. However, this might have complications we didn't think about, so we need your feedback! See https://github.com/pandas-dev/pandas/issues/19239 and https://github.com/pandas-dev/pandas/issues/21978 for related github issues on this topic. Is you are a user of the sparse functionalities of pandas, trying out master / providing feedback is much appreciated. Best, Joris (I send it both to pydata and pandas-dev mailing lists, but please answer to pandas-dev at python.org) -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Fri Nov 16 17:48:27 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Fri, 16 Nov 2018 23:48:27 +0100 Subject: [Pandas-dev] [pydata] Sparse data structures in pandas: refactor - feedback welcome! In-Reply-To: References: Message-ID: <1542408507.2476.14.camel@pietrobattiston.it> Hi Joris, thanks for the recap... Il giorno ven, 16/11/2018 alle 21.07 +0100, Joris Van den Bossche ha scritto: > [...] > - Since a normal pandas Series and DataFrame can hold sparse data, > there may be no need for the dedicated SparseSeries and > SparseDataFrame subclasses. Therefore, we are planning to deprecate > those subclasses, and the specific sparse functionality will be > accessible on normal Series/DataFrame with the `sparse` accessor.? > ? However, this might have complications we didn't think about, so we > need your feedback! >From the last dev discussion I thought we had decided to not provide (at least immediately) any actual replacement for SparseDataFrame class (in the sense of supporting 2d sparse structures, i.e. skipping columns). Unless I misunderstood, this is probably the change that users (of sparse structures) should be most aware. (On the other hand, it is true that in most cases transposing the DataFrame will probably still allow for exactly the same data structure ...) Pietro From tom.augspurger88 at gmail.com Fri Nov 16 21:34:57 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Sat, 17 Nov 2018 02:34:57 +0000 Subject: [Pandas-dev] [pydata] Sparse data structures in pandas: refactor - feedback welcome! In-Reply-To: <1542408507.2476.14.camel@pietrobattiston.it> References: , <1542408507.2476.14.camel@pietrobattiston.it> Message-ID: Just to be clear, the current sparse datatframe stores each column independently. There?s no memory saving over a DataFrame or sparse columns. ________________________________ From: pydata at googlegroups.com on behalf of Pietro Battiston Sent: Friday, November 16, 2018 16:48 To: pydata at googlegroups.com; pandas-dev at python.org Subject: Re: [pydata] Sparse data structures in pandas: refactor - feedback welcome! Hi Joris, thanks for the recap... Il giorno ven, 16/11/2018 alle 21.07 +0100, Joris Van den Bossche ha scritto: > [...] > - Since a normal pandas Series and DataFrame can hold sparse data, > there may be no need for the dedicated SparseSeries and > SparseDataFrame subclasses. Therefore, we are planning to deprecate > those subclasses, and the specific sparse functionality will be > accessible on normal Series/DataFrame with the `sparse` accessor. > However, this might have complications we didn't think about, so we > need your feedback! >From the last dev discussion I thought we had decided to not provide (at least immediately) any actual replacement for SparseDataFrame class (in the sense of supporting 2d sparse structures, i.e. skipping columns). Unless I misunderstood, this is probably the change that users (of sparse structures) should be most aware. (On the other hand, it is true that in most cases transposing the DataFrame will probably still allow for exactly the same data structure ...) Pietro -- You received this message because you are subscribed to the Google Groups "PyData" group. To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe at googlegroups.com. For more options, visit https://groups.google.com/d/optout. -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Sat Nov 17 18:10:48 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Sun, 18 Nov 2018 00:10:48 +0100 Subject: [Pandas-dev] [pydata] Sparse data structures in pandas: refactor - feedback welcome! In-Reply-To: References: ,<1542408507.2476.14.camel@pietrobattiston.it> Message-ID: <1542496248.2476.25.camel@pietrobattiston.it> Il giorno sab, 17/11/2018 alle 02.34 +0000, Tom Augspurger ha scritto: > Just to be clear, the current sparse datatframe stores each column > independently. There?s no memory saving over a DataFrame or sparse > columns.? > Oh, I had missed that (and the fact that sparse columns are never consolidated!). Thanks, Pietro From jorisvandenbossche at gmail.com Sun Nov 18 03:58:59 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sun, 18 Nov 2018 09:58:59 +0100 Subject: [Pandas-dev] [pydata] Sparse data structures in pandas: refactor - feedback welcome! In-Reply-To: <1542496248.2476.25.camel@pietrobattiston.it> References: <1542408507.2476.14.camel@pietrobattiston.it> <1542496248.2476.25.camel@pietrobattiston.it> Message-ID: Op zo 18 nov. 2018 om 00:10 schreef Pietro Battiston : > Il giorno sab, 17/11/2018 alle 02.34 +0000, Tom Augspurger ha scritto: > > Just to be clear, the current sparse datatframe stores each column > > independently. There?s no memory saving over a DataFrame or sparse > > columns. > > > > Oh, I had missed that (and the fact that sparse columns are never > consolidated!). Thanks, > > Yes, so storage wise, a DataFrame with sparse columns or a SparseDataFrame is identical under the hood. The main difference is that SparseDataFrame adds some extra functionality (sparse-specific methods, those could be exposed on a normal DataFrame with a sparse accessor) and guarantees (each column will be sparse, a default fill value for the full dataframe, ..). So the question we mostly seek feedback on is whether a normal DataFrame with sparse columns would suffice in practice in most cases, or which aspects of the current SparseDataFrame would be blockers to be able to switch to a normal DataFrame with sparse columns. Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From nathan1465 at gmail.com Wed Nov 28 20:54:16 2018 From: nathan1465 at gmail.com (Nathan Heafner) Date: Wed, 28 Nov 2018 20:54:16 -0500 Subject: [Pandas-dev] pandas 2 still active? Message-ID: I just wanted to know if Pandas 2 is still being developed? The repo seems like its abandoned. ~Nathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Nov 28 20:59:23 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 28 Nov 2018 19:59:23 -0600 Subject: [Pandas-dev] pandas 2 still active? In-Reply-To: References: Message-ID: hi Nathan, All relevant work along these lines the last couple of years has been happening in the Apache Arrow project (https://github.com/apache/arrow). What we had been calling "pandas 2" will most likely appear with a new project name (and be a companion project to pandas-dev/pandas) at some point in the future, but it's going to take some time. Projects like these, particularly given the sparse funding situation, take many years to build. - Wes On Wed, Nov 28, 2018 at 7:56 PM Nathan Heafner wrote: > > I just wanted to know if Pandas 2 is still being developed? The repo seems like its abandoned. > > > ~Nathan > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev