[omaha] Omaha Digest, Vol 77, Issue 8

Jeff Hinrichs - DM&T jeffh at dundeemt.com
Wed Jul 17 22:42:58 CEST 2013


The public source of the data appears to be pdfs.
http://www.cityofomaha.org/cityclerk/city-council/agendas

-j


On Wed, Jul 17, 2013 at 3:23 PM, Dan Linder <dan at linder.org> wrote:

> Did I mis-read, or isn't the contents of these PDFs coming from a template
> or text processing engine originally?  Why not generate the HTML and
> searchable contents from there and build the PDF after the fact?  Or am I
> missing the source of the data?
>
> Dan
>
> FWIW, If you need a command-line tool to convert a URL to PDF, check out
> WKHtmlToPDF (http://code.google.com/p/wkhtmltopdf/).  It has binaries for
> Linux, OSX, and Windows.
>
>
> On Wed, Jul 17, 2013 at 12:03 PM, Wes Turner <wes.turner at gmail.com> wrote:
>
> > Conveniently just found this IPython notebook demonstrating OCR in
> Python:
> >
> >
> http://nbviewer.ipython.org/url/ocropy.ocropus.googlecode.com/hg/Notebooks/ocropus-steps.ipynb
> >  On Jul 17, 2013 11:44 AM, "Wes Turner" <wes.turner at gmail.com> wrote:
> >
> > > I remember reading about what the major search engines use for
> > PDF-to-HTML
> > > and OCR for bitmap PDFs but can't find the link. Many of the desktop
> > search
> > > programs do have some sort of PDF support.
> > >
> > > On Jul 17, 2013 9:00 AM, "Mike Hostetler" <mike at squarepegsystems.com>
> > > wrote:
> > > >
> > > > PDF conversion to anything is hard. Turning it into anything
> structured
> > > > like HTML or any markup language would be fraught with peril. If you
> > are
> > > > lucky, you can dump them to straight, plain text. If you can do that,
> > > leave
> > > > it alone.
> > > >
> > > > Matt Steele came to me for advice on how to parse the PDFs (I had
> > tweeted
> > > > that I was doing it at work so he contacted me off-line) and what
> they
> > > used
> > > > matches fairly closely with what I do. The kicker is that sometimes
> > > people
> > > > simply scan a page and save that image in a PDF. It looks like text,
> > but
> > > > it's actually an image. So you can't use normal PDF Parsing tools
> with
> > > > that. The best you can do it extract each image, and run an OCR tool
> on
> > > > them (which is what tesseract is).
> > > >
> > > > Throwing the result into ElasticSearch is certainly the best option
> --
> > > it's
> > > > magical on how well it works. Just pump the text files into an ES
> > server
> > > > via it's REST interface and, volia, you can google your documents.
> > > >
> > > > http://www.elasticsearch.org/overview/
> > > >
> > >
> > > Faceted query support like that provided by ElasticSearch (Lucene)
> would
> > > be helpful.
> > >
> > > I was just looking into a Kibana3 query dashboard for ElasticSearch
> (for
> > > Heka). I am somewhat familiar with PyES, but Kibana3 is all JavaScript
> > > (like the Deniz SPARQL browser).
> > >
> > > Not sure how/whether Kibana3 supports any sort of throttling. (E.g. how
> > > resilient is an un-reverse-proxied ES?)
> > >
> > > * https://tika.apache.org/1.4/formats.html#Portable_Document_Format
> > >
> > > *
> > >
> >
> https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/README.md
> > >
> > > While convenient for archiving, most PDF serializations of editor
> > document
> > > trees end up mixing way too much presentation with the content.
> > >
> > > The PDF.js HTML5 PDF viewer is also great.
> > >
> > > >
> > > >
> > > >
> > > > On Wed, Jul 17, 2013 at 1:07 AM, Wes Turner <wes.turner at gmail.com>
> > > wrote:
> > > >
> > > > > So the data/publishing process would be something like:
> > > > >
> > > > >     Template -> Text Processor -> PDF -> SQL/search -> Template ->
> > HTML
> > > > >
> > > > > ?
> > > > >
> > > > > I definitely see the value in getting past agenda PDFs indexed and
> > > > > searchable, but wouldn't it make more sense to just save as HTML?
> (Or
> > > .RST
> > > > > :o)
> > > > >
> > > > >     Template -> Text Processor -> HTML (-> ___ custom search)
> > > > >
> > > > > That way, we could read these documents on a device with a smaller
> > > screen.
> > > > > And permalink to headers. And adjust font sizes. And search.
> > > > >
> > > > > .. see also: http://redd.it/140oxj
> > > > > On Jul 16, 2013 9:15 AM, "Matt Wynn" <matt.wynn at gmail.com> wrote:
> > > > >
> > > > > > Jay, you're my people!
> > > > > >
> > > > > > And Jim, I think Jeff and I had talked about council agendas as a
> > fun
> > > > > > project to toy with. I've heard from folks at the city, for
> > instance,
> > > > > that
> > > > > > they'd love to have just the underlying functionality available.
> > > That way
> > > > > > they could create have a centralized source for finding or
> > displaying
> > > > > > zoning/planning/liquor/whatever issues. to separate the business
> > > logic of
> > > > > > that project from the site itself would have value -- I'd host it
> > in
> > > a
> > > > > > heartbeat.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 16, 2013 at 5:00 AM, <omaha-request at python.org>
> wrote:
> > > > > >
> > > > > > > Send Omaha mailing list submissions to
> > > > > > >         omaha at python.org
> > > > > > >
> > > > > > > To subscribe or unsubscribe via the World Wide Web, visit
> > > > > > >         http://mail.python.org/mailman/listinfo/omaha
> > > > > > > or, via email, send a message with subject or body 'help' to
> > > > > > >         omaha-request at python.org
> > > > > > >
> > > > > > > You can reach the person managing the list at
> > > > > > >         omaha-owner at python.org
> > > > > > >
> > > > > > > When replying, please edit your Subject line so it is more
> > specific
> > > > > > > than "Re: Contents of Omaha digest..."
> > > > > > >
> > > > > > >
> > > > > > > Today's Topics:
> > > > > > >
> > > > > > >    1. Open source projects/Hack Omaha (Matt Wynn)
> > > > > > >    2. Re: Open source projects/Hack Omaha (James Drake)
> > > > > > >    3. Re: Open source projects/Hack Omaha (Jay Hannah)
> > > > > > >
> > > > > > >
> > > > > > >
> > > ----------------------------------------------------------------------
> > > > > > >
> > > > > > > Message: 1
> > > > > > > Date: Mon, 15 Jul 2013 15:46:53 -0500
> > > > > > > From: Matt Wynn <matt.wynn at gmail.com>
> > > > > > > To: omaha at python.org
> > > > > > > Subject: [omaha] Open source projects/Hack Omaha
> > > > > > > Message-ID:
> > > > > > >         <CAFXqcUrnfbEkTQWmgTxQv5b3eyJqF5=
> > > > > > > iXJqcXtu6vBeiwpF7Bg at mail.gmail.com>
> > > > > > > Content-Type: text/plain; charset=ISO-8859-1
> > > > > > >
> > > > > > > Jeff Hinrichs and I had lunch last week and wanted to have this
> > > > > > discussion
> > > > > > > on list. So pardon the royal Wes and other whatnot.
> > > > > > >
> > > > > > > The basic gist of our conversation was finding ways Omaha
> Python
> > > and
> > > > > Hack
> > > > > > > Omaha could play together. For those that don't know, Hack
> Omaha
> > > is an
> > > > > > > annual event/competition, sponsored by the World-Herald, to
> make
> > > cool
> > > > > > civic
> > > > > > > projects. Generally, the projects start with civic information,
> > > and aim
> > > > > > to
> > > > > > > make that information more useful, more fun, or just more
> > > accessible.
> > > > > > > (We're tentatively scheduled for Oct 18-20 this year, if it
> > sounds
> > > > > > > intriguing).
> > > > > > >
> > > > > > > If I recall correctly, there were two ways we thought the
> groups
> > > could
> > > > > > fit
> > > > > > > together:
> > > > > > >
> > > > > > > - First, of course, folks from the group could compete in Hack
> > > Omaha. I
> > > > > > > hope you do, but I understand giving up a weekend to code isn't
> > > always
> > > > > a
> > > > > > > great attraction. (Though this one will be better than the
> > previous
> > > > > > two...
> > > > > > > and I thought those were pretty fun).
> > > > > > > - Second, members could adopt projects after the competition,
> and
> > > > > either
> > > > > > > move them forward or convert them to Python. Case in point, the
> > > winner
> > > > > > from
> > > > > > > the last event:
> > > > > https://github.com/mattdsteele/hackomaha-council-agendas
> > > > > > .
> > > > > > > It pulls down nasty city council agenda PDFs from the city's
> > > website,
> > > > > > then
> > > > > > > indexes them and makes them text-searchable. It's awesome. It's
> > > also in
> > > > > > > Perl.
> > > > > > >
> > > > > > > If a project is Pythonized, The World-Herald can host it on
> > > > > > dataomaha.com,
> > > > > > > where we'd be super happy to credit anyone who had their hands
> in
> > > the
> > > > > app
> > > > > > > itself. So now we're working with the trifecta:
> > > > > > > - Fun open source projects that build community
> > > > > > > - And make local government data more useful
> > > > > > > - And make killer entries on a resume
> > > > > > >
> > > > > > > Am I missing anything, Jeff? And non-Jeffs, all of this might
> be
> > > > > > bullshit.
> > > > > > > We also might be missing something key.
> > > > > > >
> > > > > > > Does any of this sound good? Is there anything you think we're
> > > missing?
> > > > > > > Tell us what sucks!
> > > > > > >
> > > > > > > -Matt
> > > > > > >
> > > > > > >
> > > > > > > ------------------------------
> > > > > > >
> > > > > > > Message: 2
> > > > > > > Date: Mon, 15 Jul 2013 16:23:03 -0500
> > > > > > > From: James Drake <jim.drake at gmail.com>
> > > > > > > To: Omaha Python Users Group <omaha at python.org>
> > > > > > > Subject: Re: [omaha] Open source projects/Hack Omaha
> > > > > > > Message-ID:
> > > > > > >         <
> > > > > > >
> > CAK-qddObXb3gSOjULuxF_XLAkY37UMe2e6sGRhwZ+kLGeaKW+g at mail.gmail.com
> > > >
> > > > > > > Content-Type: text/plain; charset=UTF-8
> > > > > > >
> > > > > > > Matt,
> > > > > > >
> > > > > > > I'd be interested in putting my meager talents to work on
> > project,
> > > > > maybe
> > > > > > > sooner rather than later. Are there any left-over that weren't
> > > > > attempted?
> > > > > > > The timing for volunteering may just work and learning
> something
> > > new
> > > > > may
> > > > > > > just work better for me. I'm more comfortable with Perl now but
> > > would
> > > > > be
> > > > > > > happy to work at a slower pace in Python too.
> > > > > > >
> > > > > > > Jim
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jul 15, 2013 at 3:46 PM, Matt Wynn <
> matt.wynn at gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Jeff Hinrichs and I had lunch last week and wanted to have
> this
> > > > > > > discussion
> > > > > > > > on list. So pardon the royal Wes and other whatnot.
> > > > > > > >
> > > > > > > > The basic gist of our conversation was finding ways Omaha
> > Python
> > > and
> > > > > > Hack
> > > > > > > > Omaha could play together. For those that don't know, Hack
> > Omaha
> > > is
> > > > > an
> > > > > > > > annual event/competition, sponsored by the World-Herald, to
> > make
> > > cool
> > > > > > > civic
> > > > > > > > projects. Generally, the projects start with civic
> information,
> > > and
> > > > > aim
> > > > > > > to
> > > > > > > > make that information more useful, more fun, or just more
> > > accessible.
> > > > > > > > (We're tentatively scheduled for Oct 18-20 this year, if it
> > > sounds
> > > > > > > > intriguing).
> > > > > > > >
> > > > > > > > If I recall correctly, there were two ways we thought the
> > groups
> > > > > could
> > > > > > > fit
> > > > > > > > together:
> > > > > > > >
> > > > > > > > - First, of course, folks from the group could compete in
> Hack
> > > > > Omaha. I
> > > > > > > > hope you do, but I understand giving up a weekend to code
> isn't
> > > > > always
> > > > > > a
> > > > > > > > great attraction. (Though this one will be better than the
> > > previous
> > > > > > > two...
> > > > > > > > and I thought those were pretty fun).
> > > > > > > > - Second, members could adopt projects after the competition,
> > and
> > > > > > either
> > > > > > > > move them forward or convert them to Python. Case in point,
> the
> > > > > winner
> > > > > > > from
> > > > > > > > the last event:
> > > > > > https://github.com/mattdsteele/hackomaha-council-agendas
> > > > > > > .
> > > > > > > > It pulls down nasty city council agenda PDFs from the city's
> > > website,
> > > > > > > then
> > > > > > > > indexes them and makes them text-searchable. It's awesome.
> It's
> > > also
> > > > > in
> > > > > > > > Perl.
> > > > > > > >
> > > > > > > > If a project is Pythonized, The World-Herald can host it on
> > > > > > > dataomaha.com,
> > > > > > > > where we'd be super happy to credit anyone who had their
> hands
> > > in the
> > > > > > app
> > > > > > > > itself. So now we're working with the trifecta:
> > > > > > > > - Fun open source projects that build community
> > > > > > > > - And make local government data more useful
> > > > > > > > - And make killer entries on a resume
> > > > > > > >
> > > > > > > > Am I missing anything, Jeff? And non-Jeffs, all of this might
> > be
> > > > > > > bullshit.
> > > > > > > > We also might be missing something key.
> > > > > > > >
> > > > > > > > Does any of this sound good? Is there anything you think
> we're
> > > > > missing?
> > > > > > > > Tell us what sucks!
> > > > > > > >
> > > > > > > > -Matt
> > > > > > > > _______________________________________________
> > > > > > > > Omaha Python Users Group mailing list
> > > > > > > > Omaha at python.org
> > > > > > > > http://mail.python.org/mailman/listinfo/omaha
> > > > > > > > http://www.OmahaPython.org
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ------------------------------
> > > > > > >
> > > > > > > Message: 3
> > > > > > > Date: Mon, 15 Jul 2013 16:27:30 -0500
> > > > > > > From: Jay Hannah <jay at jays.net>
> > > > > > > To: Omaha Python Users Group <omaha at python.org>
> > > > > > > Subject: Re: [omaha] Open source projects/Hack Omaha
> > > > > > > Message-ID: <7FB438AD-EEDD-44AA-A42C-F37001DAE37E at jays.net>
> > > > > > > Content-Type: text/plain; charset=us-ascii
> > > > > > >
> > > > > > > On Jul 15, 2013, at 3:46 PM, Matt Wynn <matt.wynn at gmail.com>
> > > wrote:
> > > > > > > > giving up a weekend to code isn't always a great attraction.
> > > > > > >
> > > > > > > It's not? I do it a lot.   ;)
> > > > > > >
> > > > > > > > - Second, members could adopt projects after the competition,
> > and
> > > > > > either
> > > > > > > > move them forward or convert them to Python. Case in point,
> the
> > > > > winner
> > > > > > > from
> > > > > > > > the last event:
> > > > > > https://github.com/mattdsteele/hackomaha-council-agendas
> > > > > > > .
> > > > > > > > It pulls down nasty city council agenda PDFs from the city's
> > > website,
> > > > > > > then
> > > > > > > > indexes them and makes them text-searchable. It's awesome.
> It's
> > > also
> > > > > in
> > > > > > > > Perl.
> > > > > > > >
> > > > > > > > If a project is Pythonized, The World-Herald can host it on
> > > > > > > dataomaha.com,
> > > > > > >
> > > > > > > Why can't / doesn't dataomaha.com "host" stuff that's already
> > done
> > > > > that
> > > > > > > happens to be written in Perl? Just curious.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > j
> > > > > > > omacode.org
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ------------------------------
> > > > > > >
> > > > > > > Subject: Digest Footer
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Omaha mailing list
> > > > > > > Omaha at python.org
> > > > > > > http://mail.python.org/mailman/listinfo/omaha
> > > > > > >
> > > > > > >
> > > > > > > ------------------------------
> > > > > > >
> > > > > > > End of Omaha Digest, Vol 77, Issue 8
> > > > > > > ************************************
> > > > > > >
> > > > > > _______________________________________________
> > > > > > Omaha Python Users Group mailing list
> > > > > > Omaha at python.org
> > > > > > http://mail.python.org/mailman/listinfo/omaha
> > > > > > http://www.OmahaPython.org
> > > > > >
> > > > > _______________________________________________
> > > > > Omaha Python Users Group mailing list
> > > > > Omaha at python.org
> > > > > http://mail.python.org/mailman/listinfo/omaha
> > > > > http://www.OmahaPython.org
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Mike Hostetler
> > > > SquarePeg Systems
> > > > http://www.squarepegsystems.com
> > > > _______________________________________________
> > > > Omaha Python Users Group mailing list
> > > > Omaha at python.org
> > > > http://mail.python.org/mailman/listinfo/omaha
> > > > http://www.OmahaPython.org
> > >
> > _______________________________________________
> > Omaha Python Users Group mailing list
> > Omaha at python.org
> > http://mail.python.org/mailman/listinfo/omaha
> > http://www.OmahaPython.org
> >
>
>
>
> --
> ***************** ************* *********** ******* ***** *** **
> "Quis custodiet ipsos custodes?"
>     (Who can watch the watchmen?)
>     -- from the Satires of Juvenal
> "I do not fear computers, I fear the lack of them."
>     -- Isaac Asimov (Author)
> ** *** ***** ******* *********** ************* *****************
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> http://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org
>



-- 
Best,

Jeff Hinrichs
402.218.1473


More information about the Omaha mailing list