[CentralOH] Parsing PDF's

Michael S. Yanovich yanovich.1 at osu.edu
Tue Oct 5 08:51:33 CEST 2010


I have a very large PDF (roughly 500 pages!) that has a long table of
information. I'd like to be able to parse the PDF and create some
statistics about such things as how many times does something in a row
occur through out the document and more.

I've looked into PDFMiner, which is a great tool. However, it's not that
I want to just output the PDF to plain text, html, or xml. The output
for html and xml is very ugly for this pdf and the plain-text seems
manageable but it would be very time-consuming to right the code I want.

The way PDFMiner organizes the my PDF into plain text is it makes lists
of the values for each column and then moves on to the next page. So I
could in theory, hoping everything matches up go through and assume that
the first value for column A will always match the first value for
column B. But this could get tricky when getting towards then end since
the last page is only half filled.

I'm basically wondering if there exists something *like* BeautifulSoup
for PDFs? I am basically looking for something that can take a PDF
create a pythonic type object and I can go through and play with each
page and break the elements down further and examine them. Preferably in
a more user-friendly way than PDFMiner.

Any ideas?

Michael S. Yanovich

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 899 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/mailman/private/centraloh/attachments/20101005/161ac229/attachment.pgp>


More information about the CentralOH mailing list