convert .pdf files to .txt files

David Boddie david at boddie.org.uk
Sat Jun 10 17:35:00 EDT 2006


Davor wrote:
> Hi, my name is david.
> I need to read information from .pdf files and convert to .txt files,
> and I have to do this on python,
> I have been looking for libraries on python and the pdftools seems to
> be the solution, but I do not know how to use them well,
> this is the example that I found on the internet is:

[...]

> for n_page in range (1, (n_pages+1)):
>    print "Page", n_page
>    page = doc.read_page (n_page)
>    contents = page.read_contents ().contents
>    text.extend (contents_to_text (contents))
>
> print "".join (text)
>
> the problem is that on some pdf´s it generates join words and In
> spanish the "acentos"
> in words like:  "camión"  goes to --> cami/86n or
> "IMPLEMENTACIÓN"     ----->     "IMPLEMENTACI?" give strange
> characters

pdftools just extracts the textual data in the file and stores it in
Text instances - it doesn't try to interpret or decode the text. I'd
like to fix the library so that it does try and decode the text
properly and put it into unicode strings, but I don't have the time
right now.

Remember that text can be stored in PDF files in many different
ways, and that the text cannot always be extracted in its original
form.

> if someone knows how to use the pdftools and can help me it makes me
> very happy.
>
> Another thing is that I can see the letters readden from .pdf on the
> screen, but I do not know how to create a file and save this
> information inside the file a .txt

You need to do something like this:

f = open("myfilename", "w").write("".join (text))

> Sorry for my english.

Don't worry about it. It's much better than my Spanish will ever be.

Sorry I couldn't give you more help with this. You may find that the
other tools mentioned by people in this thread will do what you
need better than pdftools can at the moment.

David




More information about the Python-list mailing list