usage of <string>.encode('utf-8','xmlcharrefreplace')?

Tue Feb 19 01:38:24 EST 2008

On Feb 18, 10:52 pm, "Carsten Haese" <cars... at uniqsys.com> wrote:
> On Mon, 18 Feb 2008 21:36:17 -0800 (PST), J Peyret wrote
>
>
>
> > Well, as usual I am confused by unicode encoding errors.
>
> > I have a string with problematic characters in it which I'd like to
> > put into a postgresql table.
> > That results in a postgresql error so I am trying to fix things with
> > <string>.encode
>
> > >>> s = 'he Company\xef\xbf\xbds ticker'
> > >>> print s
> > he [UTF-8?]Company�s ticker
>
> > Trying for an encode:
>
> > >>> print s.encode('utf-8')
> > Traceback (most recent call last):
> >   File "<input>", line 1, in <module>
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
> > 10: ordinal not in range(128)
>
> > OK, that's pretty much as expected, I know this is not valid utf-8.
>
> Actually, the string *is* valid UTF-8, but you're confused about encoding and
> decoding. Encoding is the process of turning a Unicode object into a byte
> string. Decoding is the process of turning a byte string into a Unicode object.
>

...or to put it more simply:  encode() is used to covert a unicode
string into a regular string.  A unicode string looks like this:

s = u'\u0041'

but your string looks like this:

s = 'he Company\xef\xbf\xbds ticker'

Note that there is no 'u' in front of your string.  Therefore, you
can't call encode() on that string.

> Also, why are the exceptions above complaining about the 'ascii'
> codec if I am asking for 'utf-8' conversion?

If a python function requires a unicode string and a unicode string
isn't provided, then python will implicitly try to convert the string
it was given into a unicode string.  In order to convert a given
string into a unicode string, python needs to know the secret code
that was used to produce the given string.  The secret code is
otherwise known as a 'codec'.  When python attempts an implicit
conversion of a given string into a unicode string, python uses the
default codec, which is normally set to 'ascii'.  Since your string
contains non-ascii characters, you get an error.  That all happens
long before your 'utf-8' argument ever comes into play.

decode() is used to convert a regular string into a unicode string
(the opposite of encode()).  Your error is a 'decode' error(rather
than an 'encode' error):

UnicodeDecodeError

because python is implicitly trying to convert the given regular
string into a unicode string with the default ascii codec, and python
is unable to do that.