[Tutor] Struct and UTF-16

Mon Oct 3 10:03:10 CEST 2005

Thanks Kent,

My first time dealing with Python and unicode vs 'normal' strings, I
do look forward to Python 3.0... at the moment I'm just trying to
understand how to use UTF-16.

Basically, I have data which is coming straight from struct.unpack()
and it's an UTF-16 string, and I'm just trying to get my head around
dealing with the data coming in from struct, and putting my data out
through struct.

It doesn't help overly that struct considers all strings to consist of
one byte per char, whereas UTF-16 is two. And I was having trouble as
to how to write UTF-16 stuff out properly.

But, if I understand it correctly, I could use

j = #some unicode string
out = j.encode("UTF-16")
pattern = "%ds" % len(out)
struct.pack(pattern, out)

without too much difficulty.

Regards,

Liam Clarke

On 10/3/05, Kent Johnson <kent37 at tds.net> wrote:
> Liam Clarke wrote:
> > What's the difference between
> >
> > x = "Hi"
> > y = x.encode("UTF-16")
> >
> > and
> >
> > y = unicode(x, "UTF-16")
>
> They are more-or-less opposite.
>
> encode() converts away from unicode. (Think of unicode as the 'normal' format, anything else in 'encoded'.) Normally it is used on a unicode string, not a byte string. It means, "interpret this string as unicode, then convert it to an encoded byte string using the given encoding".
>
> When you encode a non-unicode string (like "Hi"), the string is first converted to unicode (decoded) using sys.getdefaultencoding(), then encoded using the supplied encoding. So
> 'Hi'.encode('utf-16')
> is the same as
> 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')
>
> In either case, the result is a string in UTF-16 encoding:
>  >>> 'Hi'.encode('UTF-16')
> '\xff\xfeH\x00i\x00'
>  >>> 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')
> '\xff\xfeH\x00i\x00'
>
> Note that the utf-16 codec puts a byte-order mark ('\xff\xfe') in the output; then 'H' becomes 'H\x00' and 'i' becomes 'i\x00'.
>
> Because sys.getdefaultencoding() is used to convert to unicode, you will get an error if the original string cannot be decoded with this encoding:
>
>  >>> '\xe3'.encode('utf-16')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
>
>
> What about unicode('Hi', 'utf-16')? This doesn't do anything useful:
>  >>> unicode('Hi', 'UTF-16')
> u'\u6948'
>
> unicode('Hi', 'utf-16') means the same as 'Hi'.decode('utf-16'). In this case we are saying, "Interpret this string as an encoded byte string in the given encoding, and convert it to a unicode string." Since 'Hi' is not, in fact, a byte string encoded in UTF-16, the results are not very useful.
>
>
> To summarize:
> If you have an encoded byte string and you want a unicode string, use str.decode() or unicode()
>
> If you have a unicode string and you want an encoded byte string, use unicode.encode().
>
> If you are using str.encode() you probably haven't though through your problem completely and you will likely get UnicodeDecodeErrors when you have non-ASCII data.
>
>
> If you are writing a unicode-aware application, a good strategy is to keep all strings internally as unicode and to convert to and from the required encodings at the boundaries.
>
> Kent
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>