Help with character encodings
J. Cliff Dyer
jcd at sdf.lonestar.org
Tue May 20 11:50:17 EDT 2008
On Tue, 2008-05-20 at 08:28 -0700, Gary Herron wrote:
> A_H wrote:
> > Help!
> >
> > I've scraped a PDF file for text and all the minus signs come back as
> > u'\xad'.
> >
> > Is there any easy way I can change them all to plain old ASCII '-' ???
> >
> > str.replace complained about a missing codec.
> >
> >
> >
> > Hints?
> >
>
> Encoding it into a 'latin1' encoded string seems to work:
>
> >>> print u'\xad'.encode('latin1')
> -
>
>
Here's what I've found:
>>> x = u'\xad'
>>> x.replace('\xad','-')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 0:
ordinal not in range(128)
>>> x.replace(u'\xad','-')
u'-'
If you replace the *string* '\xad' in the first argument to replace with
the *unicode object* u'\xad', python won't complain anymore. (Mind you,
you weren't using str.replace. You were using unicode.replace. Slight
difference, but important.) If you do the replace on a plain string, it
doesn't have to convert anything, so you don't get a UnicodeDecodeError.
>>> x = x.encode('latin1')
>>> x
'\xad'
>>> # Note the lack of a u before the ' above.
>>> x.replace('\xad','-')
'-'
>>>
Cheers,
Cliff
More information about the Python-list
mailing list