[omaha] Omaha Digest, Vol 77, Issue 8

Mike Hostetler mike at squarepegsystems.com
Wed Jul 17 16:00:43 CEST 2013


PDF conversion to anything is hard. Turning it into anything structured
like HTML or any markup language would be fraught with peril. If you are
lucky, you can dump them to straight, plain text. If you can do that, leave
it alone.

Matt Steele came to me for advice on how to parse the PDFs (I had tweeted
that I was doing it at work so he contacted me off-line) and what they used
matches fairly closely with what I do. The kicker is that sometimes people
simply scan a page and save that image in a PDF. It looks like text, but
it's actually an image. So you can't use normal PDF Parsing tools with
that. The best you can do it extract each image, and run an OCR tool on
them (which is what tesseract is).

Throwing the result into ElasticSearch is certainly the best option -- it's
magical on how well it works. Just pump the text files into an ES server
via it's REST interface and, volia, you can google your documents.

http://www.elasticsearch.org/overview/




On Wed, Jul 17, 2013 at 1:07 AM, Wes Turner <wes.turner at gmail.com> wrote:

> So the data/publishing process would be something like:
>
>     Template -> Text Processor -> PDF -> SQL/search -> Template -> HTML
>
> ?
>
> I definitely see the value in getting past agenda PDFs indexed and
> searchable, but wouldn't it make more sense to just save as HTML? (Or .RST
> :o)
>
>     Template -> Text Processor -> HTML (-> ___ custom search)
>
> That way, we could read these documents on a device with a smaller screen.
> And permalink to headers. And adjust font sizes. And search.
>
> .. see also: http://redd.it/140oxj
> On Jul 16, 2013 9:15 AM, "Matt Wynn" <matt.wynn at gmail.com> wrote:
>
> > Jay, you're my people!
> >
> > And Jim, I think Jeff and I had talked about council agendas as a fun
> > project to toy with. I've heard from folks at the city, for instance,
> that
> > they'd love to have just the underlying functionality available. That way
> > they could create have a centralized source for finding or displaying
> > zoning/planning/liquor/whatever issues. to separate the business logic of
> > that project from the site itself would have value -- I'd host it in a
> > heartbeat.
> >
> >
> >
> >
> > On Tue, Jul 16, 2013 at 5:00 AM, <omaha-request at python.org> wrote:
> >
> > > Send Omaha mailing list submissions to
> > >         omaha at python.org
> > >
> > > To subscribe or unsubscribe via the World Wide Web, visit
> > >         http://mail.python.org/mailman/listinfo/omaha
> > > or, via email, send a message with subject or body 'help' to
> > >         omaha-request at python.org
> > >
> > > You can reach the person managing the list at
> > >         omaha-owner at python.org
> > >
> > > When replying, please edit your Subject line so it is more specific
> > > than "Re: Contents of Omaha digest..."
> > >
> > >
> > > Today's Topics:
> > >
> > >    1. Open source projects/Hack Omaha (Matt Wynn)
> > >    2. Re: Open source projects/Hack Omaha (James Drake)
> > >    3. Re: Open source projects/Hack Omaha (Jay Hannah)
> > >
> > >
> > > ----------------------------------------------------------------------
> > >
> > > Message: 1
> > > Date: Mon, 15 Jul 2013 15:46:53 -0500
> > > From: Matt Wynn <matt.wynn at gmail.com>
> > > To: omaha at python.org
> > > Subject: [omaha] Open source projects/Hack Omaha
> > > Message-ID:
> > >         <CAFXqcUrnfbEkTQWmgTxQv5b3eyJqF5=
> > > iXJqcXtu6vBeiwpF7Bg at mail.gmail.com>
> > > Content-Type: text/plain; charset=ISO-8859-1
> > >
> > > Jeff Hinrichs and I had lunch last week and wanted to have this
> > discussion
> > > on list. So pardon the royal Wes and other whatnot.
> > >
> > > The basic gist of our conversation was finding ways Omaha Python and
> Hack
> > > Omaha could play together. For those that don't know, Hack Omaha is an
> > > annual event/competition, sponsored by the World-Herald, to make cool
> > civic
> > > projects. Generally, the projects start with civic information, and aim
> > to
> > > make that information more useful, more fun, or just more accessible.
> > > (We're tentatively scheduled for Oct 18-20 this year, if it sounds
> > > intriguing).
> > >
> > > If I recall correctly, there were two ways we thought the groups could
> > fit
> > > together:
> > >
> > > - First, of course, folks from the group could compete in Hack Omaha. I
> > > hope you do, but I understand giving up a weekend to code isn't always
> a
> > > great attraction. (Though this one will be better than the previous
> > two...
> > > and I thought those were pretty fun).
> > > - Second, members could adopt projects after the competition, and
> either
> > > move them forward or convert them to Python. Case in point, the winner
> > from
> > > the last event:
> https://github.com/mattdsteele/hackomaha-council-agendas
> > .
> > > It pulls down nasty city council agenda PDFs from the city's website,
> > then
> > > indexes them and makes them text-searchable. It's awesome. It's also in
> > > Perl.
> > >
> > > If a project is Pythonized, The World-Herald can host it on
> > dataomaha.com,
> > > where we'd be super happy to credit anyone who had their hands in the
> app
> > > itself. So now we're working with the trifecta:
> > > - Fun open source projects that build community
> > > - And make local government data more useful
> > > - And make killer entries on a resume
> > >
> > > Am I missing anything, Jeff? And non-Jeffs, all of this might be
> > bullshit.
> > > We also might be missing something key.
> > >
> > > Does any of this sound good? Is there anything you think we're missing?
> > > Tell us what sucks!
> > >
> > > -Matt
> > >
> > >
> > > ------------------------------
> > >
> > > Message: 2
> > > Date: Mon, 15 Jul 2013 16:23:03 -0500
> > > From: James Drake <jim.drake at gmail.com>
> > > To: Omaha Python Users Group <omaha at python.org>
> > > Subject: Re: [omaha] Open source projects/Hack Omaha
> > > Message-ID:
> > >         <
> > > CAK-qddObXb3gSOjULuxF_XLAkY37UMe2e6sGRhwZ+kLGeaKW+g at mail.gmail.com>
> > > Content-Type: text/plain; charset=UTF-8
> > >
> > > Matt,
> > >
> > > I'd be interested in putting my meager talents to work on project,
> maybe
> > > sooner rather than later. Are there any left-over that weren't
> attempted?
> > > The timing for volunteering may just work and learning something new
> may
> > > just work better for me. I'm more comfortable with Perl now but would
> be
> > > happy to work at a slower pace in Python too.
> > >
> > > Jim
> > >
> > >
> > > On Mon, Jul 15, 2013 at 3:46 PM, Matt Wynn <matt.wynn at gmail.com>
> wrote:
> > >
> > > > Jeff Hinrichs and I had lunch last week and wanted to have this
> > > discussion
> > > > on list. So pardon the royal Wes and other whatnot.
> > > >
> > > > The basic gist of our conversation was finding ways Omaha Python and
> > Hack
> > > > Omaha could play together. For those that don't know, Hack Omaha is
> an
> > > > annual event/competition, sponsored by the World-Herald, to make cool
> > > civic
> > > > projects. Generally, the projects start with civic information, and
> aim
> > > to
> > > > make that information more useful, more fun, or just more accessible.
> > > > (We're tentatively scheduled for Oct 18-20 this year, if it sounds
> > > > intriguing).
> > > >
> > > > If I recall correctly, there were two ways we thought the groups
> could
> > > fit
> > > > together:
> > > >
> > > > - First, of course, folks from the group could compete in Hack
> Omaha. I
> > > > hope you do, but I understand giving up a weekend to code isn't
> always
> > a
> > > > great attraction. (Though this one will be better than the previous
> > > two...
> > > > and I thought those were pretty fun).
> > > > - Second, members could adopt projects after the competition, and
> > either
> > > > move them forward or convert them to Python. Case in point, the
> winner
> > > from
> > > > the last event:
> > https://github.com/mattdsteele/hackomaha-council-agendas
> > > .
> > > > It pulls down nasty city council agenda PDFs from the city's website,
> > > then
> > > > indexes them and makes them text-searchable. It's awesome. It's also
> in
> > > > Perl.
> > > >
> > > > If a project is Pythonized, The World-Herald can host it on
> > > dataomaha.com,
> > > > where we'd be super happy to credit anyone who had their hands in the
> > app
> > > > itself. So now we're working with the trifecta:
> > > > - Fun open source projects that build community
> > > > - And make local government data more useful
> > > > - And make killer entries on a resume
> > > >
> > > > Am I missing anything, Jeff? And non-Jeffs, all of this might be
> > > bullshit.
> > > > We also might be missing something key.
> > > >
> > > > Does any of this sound good? Is there anything you think we're
> missing?
> > > > Tell us what sucks!
> > > >
> > > > -Matt
> > > > _______________________________________________
> > > > Omaha Python Users Group mailing list
> > > > Omaha at python.org
> > > > http://mail.python.org/mailman/listinfo/omaha
> > > > http://www.OmahaPython.org
> > > >
> > >
> > >
> > > ------------------------------
> > >
> > > Message: 3
> > > Date: Mon, 15 Jul 2013 16:27:30 -0500
> > > From: Jay Hannah <jay at jays.net>
> > > To: Omaha Python Users Group <omaha at python.org>
> > > Subject: Re: [omaha] Open source projects/Hack Omaha
> > > Message-ID: <7FB438AD-EEDD-44AA-A42C-F37001DAE37E at jays.net>
> > > Content-Type: text/plain; charset=us-ascii
> > >
> > > On Jul 15, 2013, at 3:46 PM, Matt Wynn <matt.wynn at gmail.com> wrote:
> > > > giving up a weekend to code isn't always a great attraction.
> > >
> > > It's not? I do it a lot.   ;)
> > >
> > > > - Second, members could adopt projects after the competition, and
> > either
> > > > move them forward or convert them to Python. Case in point, the
> winner
> > > from
> > > > the last event:
> > https://github.com/mattdsteele/hackomaha-council-agendas
> > > .
> > > > It pulls down nasty city council agenda PDFs from the city's website,
> > > then
> > > > indexes them and makes them text-searchable. It's awesome. It's also
> in
> > > > Perl.
> > > >
> > > > If a project is Pythonized, The World-Herald can host it on
> > > dataomaha.com,
> > >
> > > Why can't / doesn't dataomaha.com "host" stuff that's already done
> that
> > > happens to be written in Perl? Just curious.
> > >
> > > Thanks,
> > >
> > > j
> > > omacode.org
> > >
> > >
> > >
> > >
> > >
> > >
> > > ------------------------------
> > >
> > > Subject: Digest Footer
> > >
> > > _______________________________________________
> > > Omaha mailing list
> > > Omaha at python.org
> > > http://mail.python.org/mailman/listinfo/omaha
> > >
> > >
> > > ------------------------------
> > >
> > > End of Omaha Digest, Vol 77, Issue 8
> > > ************************************
> > >
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > http://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
> >
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> http://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>



-- 
Mike Hostetler
SquarePeg Systems
http://www.squarepegsystems.com


More information about the Omaha mailing list