usage of <string>.encode('utf-8','xmlcharrefreplace')?

Carsten Haese carsten at uniqsys.com
Tue Feb 19 08:12:06 EST 2008


On Mon, 18 Feb 2008 22:24:56 -0800 (PST), J Peyret wrote
> [...]
> You are right, I am confused about unicode.  Guilty as charged. 

You should read http://www.amk.ca/python/howto/unicode to clear up some of
your confusion.

> [...]
> Also doesn't help that I am not sure what encoding is used in the 
> data file that I'm using.

That is, incidentally, the direct cause of the error message below.

> [...] 
>  <class 'psycopg2.ProgrammingError'>
> invalid byte sequence for encoding "UTF8": 0x92
> HINT:  This error can also happen if the byte sequence does not match
> the encoding expected by the server, which is controlled by
> "client_encoding".

What this error message means is that you've given the database a byte string
in an unknown encoding, but you're pretending (by default, i.e. by not telling
the database otherwise) that the string is utf-8 encoded. The database is
encountering a byte that should never appear in a valid utf-8 encoded byte
string, so it's raising this error, because your string is meaningless as
utf-8 encoded text.

This is not surprising, since you don't know the encoding of the string. Well,
now we know it's not utf-8.

> column is a varchar(2000) and the "guilty characters" are those used
> in my posting.

I doubt that. The error message is complaining about a byte with the value
0x92. That byte appeared nowhere in the string you posted, so the error
message must have been caused by a different string.

Now for the solution of your problem: If you don't care what the encoding of
your byte string is and you simply want to treat it as binary data, you should
use client_encoding "latin-1" or "iso8859_1" (they're different names for the
same thing). Since latin-1 simply maps the bytes 0 to 255 to unicode code
points 0 to 255, you can store any byte string in the database, and get the
same byte string back from the database. (The same is not true for utf-8 since
not every random string of bytes is a valid utf-8 encoded string.)

Hope this helps,

--
Carsten Haese
http://informixdb.sourceforge.net



More information about the Python-list mailing list