PDF Parser?

John Hunter jdhunter at ace.bsd.uchicago.edu
Mon Jul 7 12:12:51 EDT 2003


>>>>> "Miki" == Miki Tebeka <tebeka at cs.bgu.ac.il> writes:

    Miki> Hello All, I'm looking for a PDF parser.  Any pointers?

A little more info would be helpful: do you need access to all the pdf
structures or just the text?  AFAIK, there is no full pdf parser in
python.  The subject has come up several times before, so check the
google.groups archives

  http://groups.google.com/groups?q=pdf+parser+group%3A*python*&ie=UTF-8&oe=UTF-8&hl=en&btnG=Google+Search

Things people have suggested before: 

  1) use pdftotext and parse the text
  2) wrap xpdf's parser.

For example, if you have pdftotext, the following will give you a
python file-like handle to the source:

def pdf2txt(fname):
    return os.popen('pdftotext -raw -ascii7 %s -' % fname)

If you just want to search and index pdf, see
http://pdfsearch.sourceforge.net.

John Hunter





More information about the Python-list mailing list