[omaha] PDF to Text parsing

Jeff Hinrichs - DM&T jeffh at dundeemt.com
Thu Oct 4 16:06:16 CEST 2012


Let us know if we can help.  Looks like there is plenty of experience here
on the list.

Best,

Jeff

On Thu, Oct 4, 2012 at 12:41 AM, Rob Townley <rob.townley at gmail.com> wrote:

> Thanks Jeff ... starting a new project when already been doing other
> things for 18 hours ... the link and code helps.
>
> The ReEnergizeProgram.org would be a 3rd party and do not have
> permission to access MUD/OPPD bills directly.  They could use a portal
> that individuals upload their bills to and then parse the PDF on the
> website and bring into a db.   It would be great for the customer to
> get it as CSV/XML/JSON, but that is a great deal of data to warehouse.
>  Ideally MUD would use RF energy harvesting to keep their batteries
> charged and provide homeowners with real time utility usage.
>
> i still have to sign up for accounts on MUD / OPPD to get PDFs.  I
> could probably share the awful amount of water used to regrow my lawn.
>
> i do appreciate everyone's help ...
>
> On Tue, Oct 2, 2012 at 11:18 AM, Jeff Hinrichs <jeffh at delasco.com> wrote:
> > Depending on how they generate their pdfs.   This StackOverFlow
> >
> http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text
> > article
> > has useful info.   I do quite a bit of pdf handling here at work and I
> use
> > pyPDF.   The article points out the useful bits for text extraction.
> >
> > import pyPdf
> > pdf = pyPdf.PdfFileReader(open(filename, "rb"))
> > for page in pdf.pages:
> >     print page.extractText()
> >
> >
> > I only use it to do "sanity" checks when I'm breaking out individual
> > invoices from batch runs.   Of course, the better way to do this is to
> > interact with their back end data directly.   Getting the text is just
> the
> > first step, then you need to parse it and locate the useful bits.  All of
> > that should be available from their accounting db.  However, their
> > situation may not allow access but if they don't have access yet, the
> first
> > thing I would do would be build a mechanism for querying (read only) from
> > their primary data.
> >
> > Our internally developed (python/django) document management systems ties
> > together generated documents to accounting and shipping information so
> any
> > csr/manager can get the information.  It's a REST based interface, so
> they
> > just refer to an URL with in email messages internally.  They can fax or
> > email information/copies to customers from within the system interface.
> >
> >
> >
> > -Jeff
> >
> >
> >
> > On Tue, Oct 2, 2012 at 11:03 AM, Rob Townley <rob.townley at gmail.com>
> wrote:
> >
> >> The ReEnergizeProgram.org auditor said that a big slowdown is getting
> >> all the data from PDF based bills from MUD and OPPD into a spreadsheet
> >> / database.  Sounds like they email stuff, copy-n-paste alot, and then
> >> email on.
> >>
> >> What perl/python/php modules would you recommend for parsing the text
> from
> >> PDF?
> >> _______________________________________________
> >> Omaha Python Users Group mailing list
> >> Omaha at python.org
> >> http://mail.python.org/mailman/listinfo/omaha
> >> http://www.OmahaPython.org
> >>
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > http://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> http://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>



-- 
Best,

Jeff Hinrichs
402.218.1473


More information about the Omaha mailing list