Errors with PyPdf

Sun Sep 26 22:19:40 EDT 2010

On Sep 27, 12:08 pm, flebber <flebber.c... at gmail.com> wrote:
> On Sep 27, 10:39 am, flebber <flebber.c... at gmail.com> wrote:
>
>
>
> > On Sep 27, 9:38 am, "w.g.sned... at gmail.com" <w.g.sned... at gmail.com>
> > wrote:
>
> > > On Sep 26, 7:10 pm, flebber <flebber.c... at gmail.com> wrote:
>
> > > > I was trying to use Pypdf following a recipe from the Activestate
> > > > cookbooks. However I cannot get it too work. Unsure if it is me or it
> > > > is beacuse sets are deprecated.
>
> > > > I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
> > > > NET.pdf" You could use anything I was just testing with it.
>
> > > > I was using the last script on that page that was most recently
> > > > updated. I am using python 2.6.
>
> > > >http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...
>
> > > > import pyPdf
>
> > > > def getPDFContent(path):
> > > >     content = "C:\Components-of-Dot-NET.pdf"
> > > >     # Load PDF into pyPDF
> > > >     pdf = pyPdf.PdfFileReader(file(path, "rb"))
> > > >     # Iterate pages
> > > >     for i in range(0, pdf.getNumPages()):
> > > >         # Extract text from page and add to content
> > > >         content += pdf.getPage(i).extractText() + "\n"
> > > >     # Collapse whitespace
> > > >     content = " ".join(content.replace(u"\xa0", " ").strip().split())
> > > >     return content
>
> > > > print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
> > > > "ignore")
>
> > > > This is my error.
>
> > > > Warning (from warnings module):
> > > >   File "C:\Documents and Settings\Family\Application Data\Python
> > > > \Python26\site-packages\pyPdf\pdf.py", line 52
> > > >     from sets import ImmutableSet
> > > > DeprecationWarning: the sets module is deprecated
>
> > > > Traceback (most recent call last):
> > > >   File "C:/Python26/Pdfread", line 15, in <module>
> > > >     print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
> > > > "ignore")
> > > >   File "C:/Python26/Pdfread", line 6, in getPDFContent
> > > >     pdf = pyPdf.PdfFileReader(file(path, "rb"))
>
> > > ---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> NET.pdf'
>
> > > Looks like a issue with finding the file.
> > > how do you pass the path?
>
> > okay thanks I thought that when I set content here
>
> > def getPDFContent(path):
> >     content = "C:\Components-of-Dot-NET.pdf"
>
> > that i was defining where it is.
>
> > but yeah I updated script to below and it works. That is the contents
> > are displayed to the interpreter. How do I output to a .txt file?
>
> > import pyPdf
>
> > def getPDFContent(path):
> >     content = "C:\Components-of-Dot-NET.pdf"
> >     # Load PDF into pyPDF
> >     pdf = pyPdf.PdfFileReader(file(path, "rb"))
> >     # Iterate pages
> >     for i in range(0, pdf.getNumPages()):
> >         # Extract text from page and add to content
> >         content += pdf.getPage(i).extractText() + "\n"
> >     # Collapse whitespace
> >     content = " ".join(content.replace(u"\xa0", " ").strip().split())
> >     return content
>
> > print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
> > "ignore")
>
> I have found far more advanced scripts searching around. But will have
> to keep trying as I cannot get an output file or specify the path.
>
> Edit very strangely whilst searching for examples I found my own post
> just written here ranking number 5 on google within 2 hours. Bizzare.
>
> http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf...
>
> Replicates our thread as thiers. I was searching ggole with "pypdf
> return to txt file"

Traceback (most recent call last):
  File "C:/Python26/Pdfread", line 16, in <module>
    open('x.txt', 'w').write(content)
NameError: name 'content' is not defined
>>>

When i use.

import pyPdf

def getPDFContent(path):
    content = "C:\Components-of-Dot-NET.txt"
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
open('x.txt', 'w').write(content)