[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

boB Stepp robertvstepp at gmail.com
Tue Aug 8 23:36:22 EDT 2017


On Mon, Aug 7, 2017 at 11:30 PM, eryk sun <eryksun at gmail.com> wrote:
> On Tue, Aug 8, 2017 at 3:20 AM, Cameron Simpson <cs at cskk.id.au> wrote:
>>
>> As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This
>> is because each encoding has a leading byte order marker to indicate the big
>> endianness or little endianness. For big endian data that is \xff\xfe; for
>> little endian data it would be \xfe\xff.
>
> To avoid encoding a byte order mark (BOM), use an "le" or "be" suffix, e.g.
>
>     >>> 'Hello!'.encode('utf-16le')
>     b'H\x00e\x00l\x00l\x00o\x00!\x00'

If I do this, then I guess it becomes my responsibility to use the
correct "le" or "be" suffix when I later decode these bytes back into
Unicode code points.

> Sometimes a data format includes the byte order, which makes using a
> BOM redundant. For example, strings in the Windows registry use
> UTF-16LE, without a BOM.

Are there Windows bobby-traps that I need to watch out for because of
this?  I already know that the code pages that cmd.exe uses have
caused me some grief in displaying (or not displaying!) code points
which I have wanted to use.



-- 
boB


More information about the Tutor mailing list