[Tutor] Struct and UTF-16

Sun Oct 2 15:24:04 CEST 2005

Liam Clarke wrote:
> What's the difference between
> 
> x = "Hi"
> y = x.encode("UTF-16")
> 
> and
> 
> y = unicode(x, "UTF-16")

They are more-or-less opposite.

encode() converts away from unicode. (Think of unicode as the 'normal' format, anything else in 'encoded'.) Normally it is used on a unicode string, not a byte string. It means, "interpret this string as unicode, then convert it to an encoded byte string using the given encoding". 

When you encode a non-unicode string (like "Hi"), the string is first converted to unicode (decoded) using sys.getdefaultencoding(), then encoded using the supplied encoding. So
'Hi'.encode('utf-16')
is the same as
'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')

In either case, the result is a string in UTF-16 encoding:
 >>> 'Hi'.encode('UTF-16')
'\xff\xfeH\x00i\x00'
 >>> 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')
'\xff\xfeH\x00i\x00'

Note that the utf-16 codec puts a byte-order mark ('\xff\xfe') in the output; then 'H' becomes 'H\x00' and 'i' becomes 'i\x00'.

Because sys.getdefaultencoding() is used to convert to unicode, you will get an error if the original string cannot be decoded with this encoding:

 >>> '\xe3'.encode('utf-16')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

What about unicode('Hi', 'utf-16')? This doesn't do anything useful:
 >>> unicode('Hi', 'UTF-16')
u'\u6948'

unicode('Hi', 'utf-16') means the same as 'Hi'.decode('utf-16'). In this case we are saying, "Interpret this string as an encoded byte string in the given encoding, and convert it to a unicode string." Since 'Hi' is not, in fact, a byte string encoded in UTF-16, the results are not very useful.

To summarize:
If you have an encoded byte string and you want a unicode string, use str.decode() or unicode()

If you have a unicode string and you want an encoded byte string, use unicode.encode().

If you are using str.encode() you probably haven't though through your problem completely and you will likely get UnicodeDecodeErrors when you have non-ASCII data.

If you are writing a unicode-aware application, a good strategy is to keep all strings internally as unicode and to convert to and from the required encodings at the boundaries. 

Kent