usage of <string>.encode('utf-8','xmlcharrefreplace')?

Tue Feb 19 00:52:33 EST 2008

On Mon, 18 Feb 2008 21:36:17 -0800 (PST), J Peyret wrote
> Well, as usual I am confused by unicode encoding errors.
> 
> I have a string with problematic characters in it which I'd like to
> put into a postgresql table.
> That results in a postgresql error so I am trying to fix things with
> <string>.encode
> 
> >>> s = 'he Company\xef\xbf\xbds ticker'
> >>> print s
> he [UTF-8?]Companyï¿½s ticker
> >>>
> 
> Trying for an encode:
> 
> >>> print s.encode('utf-8')
> Traceback (most recent call last):
>   File "<input>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
> 10: ordinal not in range(128)
> 
> OK, that's pretty much as expected, I know this is not valid utf-8.

Actually, the string *is* valid UTF-8, but you're confused about encoding and
decoding. Encoding is the process of turning a Unicode object into a byte
string. Decoding is the process of turning a byte string into a Unicode object.

You need to decode your byte string into a Unicode object, and then encode the
result to a byte string in a different encoding. For example:

>>> s = 'he Company\xef\xbf\xbds ticker'
>>> s.decode("utf-8").encode("ascii", "xmlcharrefreplace")
'he Company�s ticker'

By the way, whether this is the correct fix for your PostgreSQL error is not
clear, since you kept that error message a secret for some reason. There could
be a better solution than transcoding the string in this way, but we won't
know until you show us the actual error you're trying to fix. At the moment,
it's like showing you the best way to inflate a tire with a hammer.

Hope this helps,

--
Carsten Haese
http://informixdb.sourceforge.net