[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

boB Stepp robertvstepp at gmail.com
Tue Aug 8 23:17:49 EDT 2017


On Mon, Aug 7, 2017 at 10:01 PM, Ben Finney <ben+python at benfinney.id.au> wrote:
> boB Stepp <robertvstepp at gmail.com> writes:
>
>> How is len() getting these values?
>
> By asking the objects themselves to report their length. You are
> creating different objects with different content::
>
>     >>> s = 'Hello!'
>     >>> s_utf8 = s.encode("UTF-8")
>     >>> s == s_utf8
>     False
>     >>> s_utf16 = s.encode("UTF-16")
>     >>> s == s_utf16
>     False
>     >>> s_utf32 = s.encode("UTF-32")
>     >>> s == s_utf32
>     False
>
> So it shouldn't be surprising that, with different content, they will
> have different length::
>
>     >>> type(s), len(s)
>     (<class 'str'>, 6)
>     >>> type(s_utf8), len(s_utf8)
>     (<class 'bytes'>, 6)
>     >>> type(s_utf16), len(s_utf16)
>     (<class 'bytes'>, 14)
>     >>> type(s_utf32), len(s_utf32)
>     (<class 'bytes'>, 28)
>
> What is it you think ‘str.encode’ does?

It is translating the Unicode code points into bits patterned by the
encoding specified.  I know this.  I was reading some examples from a
book and it was demonstrating the different lengths resulting from
encoding into UTF-8, 16 and 32.  I was mildly surprised that len()
even worked on these encoding results.  But for the life of me I can't
figure out for UTF-16 and 18 how these lengths are determined.  For
instance just looking at a single character:

py3: h = 'h'
py3: h16 = h.encode("UTF-16")
py3: h16
b'\xff\xfeh\x00'
py3: len(h16)
4

>From Cameron's response, I know that \xff\xfe is a Big-Endian BOM.
But in my mind 0xff takes up 4 bytes as each hex digit occupies
16-bits of space.  Likewise 0x00 looks to be 4 bytes -- Is this
representing EOL?  So far I have 8 bytes for the BOM and 4 bytes for
what I am guessing is the end-of-the-line for a byte length of 12 and
I haven't even gotten to the "h" yet!  So my question is actually as
stated:  For these encoded bytes, how are these lengths calculated?



-- 
boB


More information about the Tutor mailing list