[Tutor] PDF Scrapping

Francois Dion francois.dion at gmail.com
Wed Nov 25 12:43:51 EST 2015


This is well beyond the scope of Tutor, but let me mention the following:

The code to pdftables disappeared from github some time back. What is on
sourceforge is old, same with pypi. I wouldn't create a project using
pdftables based on that...

As far as what you are trying to do, it looks like they might have the data
in excel spreadsheets. That is totally trivial to load in pandas. if you
have any choice at all, avoid PDF at all cost to get data. See some detail
of the complexity here:
http://ieg.ifs.tuwien.ac.at/pub/yildiz_iicai_2005.pdf

For your two documents, if you cannot find the data in the excel sheets, I
think the tabula (ruby based application) approach is the best bet.

Francois

On Wed, Nov 25, 2015 at 8:41 AM, Python Beginner <
pythonbeginner004 at gmail.com> wrote:

> Oh, I forgot to mention that I am using Python 3.4. Thanks again for your
> help pointing me in the right direction.
>
> ~Chris
>
> On Tue, Nov 24, 2015 at 1:36 PM, Python Beginner <
> pythonbeginner004 at gmail.com> wrote:
>
> > Hi,
> >
> > I am looking for the best way to scrape the following PDF's:
> >
> > (1)
> > http://minerals.usgs.gov/minerals/pubs/commodity/gold/mcs-2015-gold.pdf
> > (table on page 1)
> >
> > (2)
> > http://minerals.usgs.gov/minerals/pubs/commodity/gold/myb1-2013-gold.pdf
> > (table 1)
> >
> > I have done a lot of research and have read that pdftables 0.0.4 is an
> > excellent way to scrape tabular data from PDF'S (see
> >
> https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/
> > ).
> >
> > I downloaded pdftables 0.0.4 (see https://pypi.python.org/pypi/pdftables
> ).
> >
> > I am new to Python and having trouble finding good documentation for how
> > to use this library.
> >
> > Has anybody used pdftables before that could help me get started or point
> > me to the ideal library for scrapping the PDF links above? I have read
> that
> > different PDF libraries are used depending on the format of the PDF. What
> > library would be best for the PDF formats above? Knowing this will help
> me
> > get started, then I can write up some code and ask further questions if
> > needed.
> >
> > Thanks in advance for your help!
> >
> > ~Chris
> >
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>



-- 
raspberry-python.blogspot.com - www.pyptug.org - www.3DFutureTech.info -
@f_dion


More information about the Tutor mailing list