[omaha] Omaha Digest, Vol 77, Issue 8

Wes Turner wes.turner at gmail.com
Wed Jul 17 18:44:31 CEST 2013


I remember reading about what the major search engines use for PDF-to-HTML
and OCR for bitmap PDFs but can't find the link. Many of the desktop search
programs do have some sort of PDF support.

On Jul 17, 2013 9:00 AM, "Mike Hostetler" <mike at squarepegsystems.com> wrote:
>
> PDF conversion to anything is hard. Turning it into anything structured
> like HTML or any markup language would be fraught with peril. If you are
> lucky, you can dump them to straight, plain text. If you can do that,
leave
> it alone.
>
> Matt Steele came to me for advice on how to parse the PDFs (I had tweeted
> that I was doing it at work so he contacted me off-line) and what they
used
> matches fairly closely with what I do. The kicker is that sometimes people
> simply scan a page and save that image in a PDF. It looks like text, but
> it's actually an image. So you can't use normal PDF Parsing tools with
> that. The best you can do it extract each image, and run an OCR tool on
> them (which is what tesseract is).
>
> Throwing the result into ElasticSearch is certainly the best option --
it's
> magical on how well it works. Just pump the text files into an ES server
> via it's REST interface and, volia, you can google your documents.
>
> http://www.elasticsearch.org/overview/
>

Faceted query support like that provided by ElasticSearch (Lucene) would be
helpful.

I was just looking into a Kibana3 query dashboard for ElasticSearch (for
Heka). I am somewhat familiar with PyES, but Kibana3 is all JavaScript
(like the Deniz SPARQL browser).

Not sure how/whether Kibana3 supports any sort of throttling. (E.g. how
resilient is an un-reverse-proxied ES?)

* https://tika.apache.org/1.4/formats.html#Portable_Document_Format

*
https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/README.md

While convenient for archiving, most PDF serializations of editor document
trees end up mixing way too much presentation with the content.

The PDF.js HTML5 PDF viewer is also great.

>
>
>
> On Wed, Jul 17, 2013 at 1:07 AM, Wes Turner <wes.turner at gmail.com> wrote:
>
> > So the data/publishing process would be something like:
> >
> >     Template -> Text Processor -> PDF -> SQL/search -> Template -> HTML
> >
> > ?
> >
> > I definitely see the value in getting past agenda PDFs indexed and
> > searchable, but wouldn't it make more sense to just save as HTML? (Or
.RST
> > :o)
> >
> >     Template -> Text Processor -> HTML (-> ___ custom search)
> >
> > That way, we could read these documents on a device with a smaller
screen.
> > And permalink to headers. And adjust font sizes. And search.
> >
> > .. see also: http://redd.it/140oxj
> > On Jul 16, 2013 9:15 AM, "Matt Wynn" <matt.wynn at gmail.com> wrote:
> >
> > > Jay, you're my people!
> > >
> > > And Jim, I think Jeff and I had talked about council agendas as a fun
> > > project to toy with. I've heard from folks at the city, for instance,
> > that
> > > they'd love to have just the underlying functionality available. That
way
> > > they could create have a centralized source for finding or displaying
> > > zoning/planning/liquor/whatever issues. to separate the business
logic of
> > > that project from the site itself would have value -- I'd host it in a
> > > heartbeat.
> > >
> > >
> > >
> > >
> > > On Tue, Jul 16, 2013 at 5:00 AM, <omaha-request at python.org> wrote:
> > >
> > > > Send Omaha mailing list submissions to
> > > >         omaha at python.org
> > > >
> > > > To subscribe or unsubscribe via the World Wide Web, visit
> > > >         http://mail.python.org/mailman/listinfo/omaha
> > > > or, via email, send a message with subject or body 'help' to
> > > >         omaha-request at python.org
> > > >
> > > > You can reach the person managing the list at
> > > >         omaha-owner at python.org
> > > >
> > > > When replying, please edit your Subject line so it is more specific
> > > > than "Re: Contents of Omaha digest..."
> > > >
> > > >
> > > > Today's Topics:
> > > >
> > > >    1. Open source projects/Hack Omaha (Matt Wynn)
> > > >    2. Re: Open source projects/Hack Omaha (James Drake)
> > > >    3. Re: Open source projects/Hack Omaha (Jay Hannah)
> > > >
> > > >
> > > >
----------------------------------------------------------------------
> > > >
> > > > Message: 1
> > > > Date: Mon, 15 Jul 2013 15:46:53 -0500
> > > > From: Matt Wynn <matt.wynn at gmail.com>
> > > > To: omaha at python.org
> > > > Subject: [omaha] Open source projects/Hack Omaha
> > > > Message-ID:
> > > >         <CAFXqcUrnfbEkTQWmgTxQv5b3eyJqF5=
> > > > iXJqcXtu6vBeiwpF7Bg at mail.gmail.com>
> > > > Content-Type: text/plain; charset=ISO-8859-1
> > > >
> > > > Jeff Hinrichs and I had lunch last week and wanted to have this
> > > discussion
> > > > on list. So pardon the royal Wes and other whatnot.
> > > >
> > > > The basic gist of our conversation was finding ways Omaha Python and
> > Hack
> > > > Omaha could play together. For those that don't know, Hack Omaha is
an
> > > > annual event/competition, sponsored by the World-Herald, to make
cool
> > > civic
> > > > projects. Generally, the projects start with civic information, and
aim
> > > to
> > > > make that information more useful, more fun, or just more
accessible.
> > > > (We're tentatively scheduled for Oct 18-20 this year, if it sounds
> > > > intriguing).
> > > >
> > > > If I recall correctly, there were two ways we thought the groups
could
> > > fit
> > > > together:
> > > >
> > > > - First, of course, folks from the group could compete in Hack
Omaha. I
> > > > hope you do, but I understand giving up a weekend to code isn't
always
> > a
> > > > great attraction. (Though this one will be better than the previous
> > > two...
> > > > and I thought those were pretty fun).
> > > > - Second, members could adopt projects after the competition, and
> > either
> > > > move them forward or convert them to Python. Case in point, the
winner
> > > from
> > > > the last event:
> > https://github.com/mattdsteele/hackomaha-council-agendas
> > > .
> > > > It pulls down nasty city council agenda PDFs from the city's
website,
> > > then
> > > > indexes them and makes them text-searchable. It's awesome. It's
also in
> > > > Perl.
> > > >
> > > > If a project is Pythonized, The World-Herald can host it on
> > > dataomaha.com,
> > > > where we'd be super happy to credit anyone who had their hands in
the
> > app
> > > > itself. So now we're working with the trifecta:
> > > > - Fun open source projects that build community
> > > > - And make local government data more useful
> > > > - And make killer entries on a resume
> > > >
> > > > Am I missing anything, Jeff? And non-Jeffs, all of this might be
> > > bullshit.
> > > > We also might be missing something key.
> > > >
> > > > Does any of this sound good? Is there anything you think we're
missing?
> > > > Tell us what sucks!
> > > >
> > > > -Matt
> > > >
> > > >
> > > > ------------------------------
> > > >
> > > > Message: 2
> > > > Date: Mon, 15 Jul 2013 16:23:03 -0500
> > > > From: James Drake <jim.drake at gmail.com>
> > > > To: Omaha Python Users Group <omaha at python.org>
> > > > Subject: Re: [omaha] Open source projects/Hack Omaha
> > > > Message-ID:
> > > >         <
> > > > CAK-qddObXb3gSOjULuxF_XLAkY37UMe2e6sGRhwZ+kLGeaKW+g at mail.gmail.com>
> > > > Content-Type: text/plain; charset=UTF-8
> > > >
> > > > Matt,
> > > >
> > > > I'd be interested in putting my meager talents to work on project,
> > maybe
> > > > sooner rather than later. Are there any left-over that weren't
> > attempted?
> > > > The timing for volunteering may just work and learning something new
> > may
> > > > just work better for me. I'm more comfortable with Perl now but
would
> > be
> > > > happy to work at a slower pace in Python too.
> > > >
> > > > Jim
> > > >
> > > >
> > > > On Mon, Jul 15, 2013 at 3:46 PM, Matt Wynn <matt.wynn at gmail.com>
> > wrote:
> > > >
> > > > > Jeff Hinrichs and I had lunch last week and wanted to have this
> > > > discussion
> > > > > on list. So pardon the royal Wes and other whatnot.
> > > > >
> > > > > The basic gist of our conversation was finding ways Omaha Python
and
> > > Hack
> > > > > Omaha could play together. For those that don't know, Hack Omaha
is
> > an
> > > > > annual event/competition, sponsored by the World-Herald, to make
cool
> > > > civic
> > > > > projects. Generally, the projects start with civic information,
and
> > aim
> > > > to
> > > > > make that information more useful, more fun, or just more
accessible.
> > > > > (We're tentatively scheduled for Oct 18-20 this year, if it sounds
> > > > > intriguing).
> > > > >
> > > > > If I recall correctly, there were two ways we thought the groups
> > could
> > > > fit
> > > > > together:
> > > > >
> > > > > - First, of course, folks from the group could compete in Hack
> > Omaha. I
> > > > > hope you do, but I understand giving up a weekend to code isn't
> > always
> > > a
> > > > > great attraction. (Though this one will be better than the
previous
> > > > two...
> > > > > and I thought those were pretty fun).
> > > > > - Second, members could adopt projects after the competition, and
> > > either
> > > > > move them forward or convert them to Python. Case in point, the
> > winner
> > > > from
> > > > > the last event:
> > > https://github.com/mattdsteele/hackomaha-council-agendas
> > > > .
> > > > > It pulls down nasty city council agenda PDFs from the city's
website,
> > > > then
> > > > > indexes them and makes them text-searchable. It's awesome. It's
also
> > in
> > > > > Perl.
> > > > >
> > > > > If a project is Pythonized, The World-Herald can host it on
> > > > dataomaha.com,
> > > > > where we'd be super happy to credit anyone who had their hands in
the
> > > app
> > > > > itself. So now we're working with the trifecta:
> > > > > - Fun open source projects that build community
> > > > > - And make local government data more useful
> > > > > - And make killer entries on a resume
> > > > >
> > > > > Am I missing anything, Jeff? And non-Jeffs, all of this might be
> > > > bullshit.
> > > > > We also might be missing something key.
> > > > >
> > > > > Does any of this sound good? Is there anything you think we're
> > missing?
> > > > > Tell us what sucks!
> > > > >
> > > > > -Matt
> > > > > _______________________________________________
> > > > > Omaha Python Users Group mailing list
> > > > > Omaha at python.org
> > > > > http://mail.python.org/mailman/listinfo/omaha
> > > > > http://www.OmahaPython.org
> > > > >
> > > >
> > > >
> > > > ------------------------------
> > > >
> > > > Message: 3
> > > > Date: Mon, 15 Jul 2013 16:27:30 -0500
> > > > From: Jay Hannah <jay at jays.net>
> > > > To: Omaha Python Users Group <omaha at python.org>
> > > > Subject: Re: [omaha] Open source projects/Hack Omaha
> > > > Message-ID: <7FB438AD-EEDD-44AA-A42C-F37001DAE37E at jays.net>
> > > > Content-Type: text/plain; charset=us-ascii
> > > >
> > > > On Jul 15, 2013, at 3:46 PM, Matt Wynn <matt.wynn at gmail.com> wrote:
> > > > > giving up a weekend to code isn't always a great attraction.
> > > >
> > > > It's not? I do it a lot.   ;)
> > > >
> > > > > - Second, members could adopt projects after the competition, and
> > > either
> > > > > move them forward or convert them to Python. Case in point, the
> > winner
> > > > from
> > > > > the last event:
> > > https://github.com/mattdsteele/hackomaha-council-agendas
> > > > .
> > > > > It pulls down nasty city council agenda PDFs from the city's
website,
> > > > then
> > > > > indexes them and makes them text-searchable. It's awesome. It's
also
> > in
> > > > > Perl.
> > > > >
> > > > > If a project is Pythonized, The World-Herald can host it on
> > > > dataomaha.com,
> > > >
> > > > Why can't / doesn't dataomaha.com "host" stuff that's already done
> > that
> > > > happens to be written in Perl? Just curious.
> > > >
> > > > Thanks,
> > > >
> > > > j
> > > > omacode.org
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > ------------------------------
> > > >
> > > > Subject: Digest Footer
> > > >
> > > > _______________________________________________
> > > > Omaha mailing list
> > > > Omaha at python.org
> > > > http://mail.python.org/mailman/listinfo/omaha
> > > >
> > > >
> > > > ------------------------------
> > > >
> > > > End of Omaha Digest, Vol 77, Issue 8
> > > > ************************************
> > > >
> > > _______________________________________________
> > > Omaha Python Users Group mailing list
> > > Omaha at python.org
> > > http://mail.python.org/mailman/listinfo/omaha
> > > http://www.OmahaPython.org
> > >
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > http://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
> >
>
>
>
> --
> Mike Hostetler
> SquarePeg Systems
> http://www.squarepegsystems.com
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> http://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org


More information about the Omaha mailing list