convert .pdf files to .txt files

Davor syl_stand at yahoo.es
Sat Jun 10 09:19:16 EDT 2006


Hi, my name is david.
I need to read information from .pdf files and convert to .txt files,
and I have to do this on python,
I have been looking for libraries on python and the pdftools seems to
be the solution, but I do not know how to use them well,
this is the example that I found on the internet is:


from pdftools.pdffile import PDFDocument
from pdftools.pdftext import Text

def contents_to_text (contents):
   for item in contents:
     if isinstance (item, type ([])):
       for i in contents_to_text (item):
         yield i
     elif isinstance (item, Text):
       yield item.text

doc = PDFDocument ("/home/dave/pruebas_ficheros/carlos.pdf")
n_pages = doc.count_pages ()
text = []

for n_page in range (1, (n_pages+1)):
   print "Page", n_page
   page = doc.read_page (n_page)
   contents = page.read_contents ().contents
   text.extend (contents_to_text (contents))

print "".join (text)

the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like:  "camión"  goes to --> cami/86n or
"IMPLEMENTACIÓN"     ----->     "IMPLEMENTACI?" give strange
characters
if someone knows how to use the pdftools and can help me it makes me
very happy.

Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt


Sorry for my english.
Thanks for all.




More information about the Python-list mailing list