[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

eryk sun eryksun at gmail.com
Tue Aug 8 00:30:31 EDT 2017


On Tue, Aug 8, 2017 at 3:20 AM, Cameron Simpson <cs at cskk.id.au> wrote:
>
> As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This
> is because each encoding has a leading byte order marker to indicate the big
> endianness or little endianness. For big endian data that is \xff\xfe; for
> little endian data it would be \xfe\xff.

To avoid encoding a byte order mark (BOM), use an "le" or "be" suffix, e.g.

    >>> 'Hello!'.encode('utf-16le')
    b'H\x00e\x00l\x00l\x00o\x00!\x00'

Sometimes a data format includes the byte order, which makes using a
BOM redundant. For example, strings in the Windows registry use
UTF-16LE, without a BOM.


More information about the Tutor mailing list