[Tutor] how to struct.pack a unicode string?
Steven D'Aprano
steve at pearwood.info
Sat Dec 1 08:30:55 CET 2012
On 01/12/12 12:28, eryksun wrote:
> UTF-8 was
> designed to encode all of Unicode in a way that can seamlessly pass
> through libraries that process C strings (i.e. an array of non-null
> bytes terminated by a null byte). Byte values less than 128 are ASCII;
> beyond ASCII, UTF-8 uses 2-4 bytes, and all byte values are greater
> than 127, with standardized byte order. In contrast, UTF-16 and UTF-32
> have null bytes in the string and platform-determined byte order. The
> length and order of the optional byte order mark (BOM) distinguishes
> UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.
That's not quite right. The UTF-16BE and UTF-16LE character sets do
not take BOMs, because the encoding already specifies the byte order:
py> s = u'abçЙ'
py> s.encode('utf-16LE')
'a\x00b\x00\xe7\x00\x19\x04'
py> s.encode('utf-16BE')
'\x00a\x00b\x00\xe7\x04\x19'
In contrast, plain ol' UTF-16 with no BE or LE suffix is ambiguous without
a BOM, so it uses one:
py> s.encode('utf-16')
'\xff\xfea\x00b\x00\xe7\x00\x19\x04'
The same applies to UTF-32.
> There's also a UTF-8 BOM used on Windows. Python calls this encoding
> "utf-8-sig".
UTF-8-sig, an abomination, but sadly not just a Microsoft abomination.
Google Docs also uses it.
Although the Unicode standard does allow using a BOM (not actually a
Byte Order Mark, more of a "UTF-8 signature"), doing so is annoying
and silly.
--
Steven
More information about the Tutor
mailing list