[Tutor] how to struct.pack a unicode string?

Steven D'Aprano steve at pearwood.info
Sat Dec 1 08:30:55 CET 2012


On 01/12/12 12:28, eryksun wrote:

> UTF-8 was
> designed to encode all of Unicode in a way that can seamlessly pass
> through libraries that process C strings (i.e. an array of non-null
> bytes terminated by a null byte). Byte values less than 128 are ASCII;
> beyond ASCII, UTF-8 uses 2-4 bytes, and all byte values are greater
> than 127, with standardized byte order. In contrast, UTF-16 and UTF-32
> have null bytes in the string and platform-determined byte order. The
> length and order of the optional byte order mark (BOM) distinguishes
> UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.

That's not quite right. The UTF-16BE and UTF-16LE character sets do
not take BOMs, because the encoding already specifies the byte order:

py> s = u'abçЙ'
py> s.encode('utf-16LE')
'a\x00b\x00\xe7\x00\x19\x04'
py> s.encode('utf-16BE')
'\x00a\x00b\x00\xe7\x04\x19'


In contrast, plain ol' UTF-16 with no BE or LE suffix is ambiguous without
a BOM, so it uses one:

py> s.encode('utf-16')
'\xff\xfea\x00b\x00\xe7\x00\x19\x04'


The same applies to UTF-32.


> There's also a UTF-8 BOM used on Windows. Python calls this encoding
>  "utf-8-sig".

UTF-8-sig, an abomination, but sadly not just a Microsoft abomination.
Google Docs also uses it.

Although the Unicode standard does allow using a BOM (not actually a
Byte Order Mark, more of a "UTF-8 signature"), doing so is annoying
and silly.



-- 
Steven


More information about the Tutor mailing list