Comparing UTF-8 into USC-2 and vice versa (newbie :-) )

Sun Jun 17 03:48:30 EDT 2007

> I recently rewrote a .net application in python.
> The application is basically gets streams via TCP socket and handle
> operations against an existing database.
> The Database is SQLite3 (Encoded as UTF-8).
> The Networks streams are encoded as UCS-2.
> 
> Since in UCS-2, 'A' = '0041' and when I check  with the built-in
> functions I get for  unicode("A", "utf-8") = u'A' = u'\u0041'. I
> wonder what is the difference, and how can I safely encode/decode
> UCS-2 streams and match them with the UTF-8 representation

In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean
that the output is in UTF-8, but the *input*.
So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in
UTF-8, it consumes only one byte.

For different letters, that's different: For example, for u'\xf6',
the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is
'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC'
(i.e. three bytes).

You should use Unicode objects in your program always, and encode
to or from UCS-2 or UTF-8 only when interfacing with the
network/database.

HTH,
Martin