Help with character encodings

Tue May 20 11:50:17 EDT 2008

On Tue, 2008-05-20 at 08:28 -0700, Gary Herron wrote:
> A_H wrote:
> > Help!
> >
> > I've scraped a PDF file for text and all the minus signs come back as
> > u'\xad'.
> >
> > Is there any easy way I can change them all to plain old ASCII '-' ???
> >
> > str.replace complained about a missing codec.
> >
> >
> >
> > Hints?
> >   
> 
> Encoding it into a 'latin1' encoded string seems to work:
> 
>   >>> print u'\xad'.encode('latin1')
>   -
> 
> 
Here's what I've found:

>>> x = u'\xad'
>>> x.replace('\xad','-')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 0:
ordinal not in range(128)
>>> x.replace(u'\xad','-')
u'-'

If you replace the *string* '\xad' in the first argument to replace with
the *unicode object* u'\xad', python won't complain anymore.  (Mind you,
you weren't using str.replace.  You were using unicode.replace.  Slight
difference, but important.)  If you do the replace on a plain string, it
doesn't have to convert anything, so you don't get a UnicodeDecodeError.

>>> x = x.encode('latin1')
>>> x
'\xad'
>>> # Note the lack of a u before the ' above.
>>> x.replace('\xad','-')
'-'
>>> 

Cheers,
Cliff