Python Unicode to String conversion

John Machin sjmachin at lexicon.net
Fri Aug 31 20:09:10 EDT 2007


On Sep 1, 8:55 am, thijs.br... at gmail.com wrote:
> Hi everyone,
>
> I'm having quite some troubles trying to convert Unicode to String
> (for use in psycopg, which apparently doesn't know how to cope with
> unicode strings).
>
> The error I keep having is something like this:
> ERREUR:  Séquence d'octets invalide pour le codage «UTF8» : 0xe02063
>
> (sorry, locale is french, it means "byte sequence invalid for encoding
> <<utf8>>",

I'm a pig-ignorant Anglo; it's news to me that Python error messages
varied by locale; I thought they always came out in ASCII as G(od|
uido) intended :-) Does that message emanate from Python or psycopg?
In either case, it is saying that it is expecting a UTF8-encoded
string, but the string given to it is not a valid UTF8-encoded string.

> the value is probably an e with one of the french accents)

PROBABLY?? (1) Please try to understand that computers are quite
deterministic. (2) If you want help, stop guessing and use something
like
    print repr(the_value)
and tell us what it *actually* is. Also show us the *relevant* parts
of your code, so that we can see how your are trying to convert your
data, and how you are trying to pass it to psycopg. Also show us the
full traceback that you get.

>
> I've found lots of stuff about this googling the error, but I don't
> seem to be able to find a "works always"-function just to convert a
> unicode variable back to string...
>
> If someone could find me a solution, that'd really be a lifesaver.
> I've been losing hours and hours over this one :s
>

1. Find out what your input data actually is (e.g. unicode)
2. Find out what form psycopg requires (e.g. utf8-encoded str).
3. unicode to utf8 is quite simple:

>>> useq = u"S\xe9quence"
# that's "Sequence" with an acute accent on the first "e"
>>> useq8 = useq.encode('utf8')
>>> print repr(useq)
u'S\xe9quence'
>>> print repr(useq8)
'S\xc3\xa9quence'
>>> useq8.decode('utf8')
u'S\xe9quence'
# round trip works as expected

Here is what ASCII-Python says about malformed UTF8:

>>> "\xe0\x20\x63".decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python25\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
invalid data

Cheers,
John




More information about the Python-list mailing list