byte count unicode string

John Machin sjmachin at lexicon.net
Wed Sep 20 03:39:34 EDT 2006


willie wrote:
> Marc 'BlackJack' Rintsch:
>
>  >In <mailman.313.1158732191.10491.python-l... at python.org>, willie wrote:
>  >> # What's the correct way to get the
>  >> # byte count of a unicode (UTF-8) string?
>  >> # I couldn't find a builtin method
>  >> # and the following is memory inefficient.
>
>  >> ustr = "example\xC2\x9D".decode('UTF-8')
>
>  >> num_chars = len(ustr)    # 8
>
>  >> buf = ustr.encode('UTF-8')
>
>  >> num_bytes = len(buf)     # 9
>
>  >That is the correct way.
>
>
> # Apologies if I'm being dense, but it seems
> # unusual that I'd have to make a copy of a
> # unicode string, converting it into a byte
> # string, before I can determine the size (in bytes)
> # of the unicode string. Can someone provide the rational
> # for that or correct my misunderstanding?
>

You initially asked "What's the correct way to get the  byte countof a
unicode (UTF-8) string".

It appears you meant "How can I find how many bytes there are in the
UTF-8 representation of a Unicode string without manifesting the UTF-8
representation?".

The answer is, "You can't", and the rationale would have to be that
nobody thought of a use case for counting the length of the UTF-8  form
but not creating the UTF-8 form. What is your use case?

Cheers,
John




More information about the Python-list mailing list