[Tutor] Struct and UTF-16

Mon Oct 3 12:07:38 CEST 2005

Liam Clarke wrote:
> Basically, I have data which is coming straight from struct.unpack()
> and it's an UTF-16 string, and I'm just trying to get my head around
> dealing with the data coming in from struct, and putting my data out
> through struct.
> 
> It doesn't help overly that struct considers all strings to consist of
> one byte per char, whereas UTF-16 is two. And I was having trouble as
> to how to write UTF-16 stuff out properly.
> 
> But, if I understand it correctly, I could use
> 
> j = #some unicode string
> out = j.encode("UTF-16")
> pattern = "%ds" % len(out)
> struct.pack(pattern, out)

Yes that looks good. Note that you will get a byte-order-mark as the first two bytes. If you don't want that, use utf-16le or utf-16be. The correct choice depends on what the consumer of the data expects / can deal with.

 >>> 'Hi'.encode('utf-16le')
'H\x00i\x00'
 >>> 'Hi'.encode('utf-16be')
'\x00H\x00i'
 >>> 'Hi'.encode('utf-16')
'\xff\xfeH\x00i\x00'

Kent