unicode .replace not working - why?

Sat Oct 11 19:43:07 EDT 2008

On Oct 12, 7:05 am, Kurt Peters <nospampete... at bigfoot.com> wrote:
> I'm using the code below to read a pdf document, and it has no line feeds
> or carriage returns in the imported text.  I'm therefore trying to just
> replace the symbol that looks like it would be an end of line (found by
> examining the characters in the "for loop") unichr(167).
>   Unfortunately, the replace isn't working, does anyone know what I'm
> doing wrong?  I tried a number of things so I left comments in place as a
> subset of the bunch of things I tried to no avail.

This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb"))
for pageno in range(doc.getNumPages()):
    page = doc.getPage(pageno)
    textu = page.extractText()
    print "pageno", pageno
    print type(textu)
    print repr(textu)

gives me <type 'unicode'> and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>> '\x93\x94'.decode('cp1252') # as suspected
|u'\u201c\u201d' # as expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John