unicode .replace not working - why?

Mark Tolonen M8R-yfto6h at mailinator.com
Sun Oct 12 22:53:09 EDT 2008


In your original code:

   textu.replace(unichr(167),'\n')

as Dennis suggested (but maybe you were distracted by his 'fn' replacement, 
so I'll leave it out):

   textu = textu.replace(unichr(167),'\n')

.replace does not modify the string in place.  It returns the modified 
string, so you have to reassign it.

-Mark

"Kurt Peters" <nospampeterskurt at msn.com> wrote in message 
news:-OmdnXghhrxMN2_VnZ2dnUVZ_rHinZ2d at comcast.com...
> Thanks,
>  clearly though, my "For loop" shows a character using ord(167), and using 
> print repr(textu), it shows the character \xa7 (as does Peter Oten's 
> post). So you can see what I see, here's the document I'm using - the 
> Special Use Airspace document at
> http://www.faa.gov/airports_airtraffic/air_traffic/publications/
> which is = JO 7400.8P (PDF)
>
> if you just look at page three, it shows those unusual characters.
> Once again, using a "simple" replace, doesn't seem to work.  I can't seem 
> to figure out how to get it to work, despite all the great posts 
> attempting to shed some light on the subject.
>
> Regards,
> Kurt
>
>
> "John Machin" <sjmachin at lexicon.net> wrote in message 
> news:42f39e4c-e49a-49a3-8a2c-1adbcbb81d88 at u40g2000pru.googlegroups.com...
> On Oct 12, 7:05 am, Kurt Peters <nospampete... at bigfoot.com> wrote:
>> I'm using the code below to read a pdf document, and it has no line feeds
>> or carriage returns in the imported text. I'm therefore trying to just
>> replace the symbol that looks like it would be an end of line (found by
>> examining the characters in the "for loop") unichr(167).
>> Unfortunately, the replace isn't working, does anyone know what I'm
>> doing wrong? I tried a number of things so I left comments in place as a
>> subset of the bunch of things I tried to no avail.
>
> This is the first time I've ever looked inside a PDF file, and *only*
> one file, but:
>
> import pyPdf, sys
> filename = sys.argv[1]
> doc = pyPdf.PdfFileReader(open(filename, "rb"))
> for pageno in range(doc.getNumPages()):
>    page = doc.getPage(pageno)
>    textu = page.extractText()
>    print "pageno", pageno
>    print type(textu)
>    print repr(textu)
>
> gives me <type 'unicode'> and text with lots of \n at places where
> you'd expect them.
>
> The only problem I can see is that where I see (and expect) quotation
> marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
> the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
> and apostrophes. I had a bit of a poke around:
>
> 1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
> \x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
> into \x93 and \x94).
>
> 2. Then pyPdf appears to push these through a fixed transformation
> table (_pdfDocEncoding in generic.py) and they become \ufb01 and
> \ufb02.
>
> 3. However:
> |>>> '\x93\x94'.decode('cp1252') # as suspected
> |u'\u201c\u201d' # as expected
> |>>>
>
> AFAICT there is only one reference to encoding in the pyPdf docs: "if
> pyPdf was unable to decode the string's text encoding" ...
>
> Cheers,
> John
> 




More information about the Python-list mailing list