byte count unicode string

willie willie at jamots.com
Wed Sep 20 04:12:33 EDT 2006


 >willie wrote:
 >> Marc 'BlackJack' Rintsch:
 >>
 >>  >In <mailman.313.1158732191.10491.python-l... at python.org>, willie 
wrote:
 >>  >> # What's the correct way to get the
 >>  >> # byte count of a unicode (UTF-8) string?
 >>  >> # I couldn't find a builtin method
 >>  >> # and the following is memory inefficient.

 >>  >> ustr = "example\xC2\x9D".decode('UTF-8')

 >>  >> num_chars = len(ustr)    # 8

 >>  >> buf = ustr.encode('UTF-8')

 >>  >> num_bytes = len(buf)     # 9

 >>  >That is the correct way.

 >> # Apologies if I'm being dense, but it seems
 >> # unusual that I'd have to make a copy of a
 >> # unicode string, converting it into a byte
 >> # string, before I can determine the size (in bytes)
 >> # of the unicode string. Can someone provide the rational
 >> # for that or correct my misunderstanding?

 >You initially asked "What's the correct way to get the  byte countof a
 >unicode (UTF-8) string".
 >
 >It appears you meant "How can I find how many bytes there are in the
 >UTF-8 representation of a Unicode string without manifesting the UTF-8
 >representation?".
 >
 >The answer is, "You can't", and the rationale would have to be that
 >nobody thought of a use case for counting the length of the UTF-8  form
 >but not creating the UTF-8 form. What is your use case?

# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.

name = post.input('name') # utf-8 string

# preferable
if bytes(name) > 50:
	send_http_headers()
	display_page_begin()
	display_error_msg('the name is too long')
	display_form(name)
	display_page_end()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:

buf = name.encode('UTF-8')
num_bytes = len(buf)


# That said, I'm not losing any sleep over it,
# so feel free to disregard any of this if it's
# way off base.



More information about the Python-list mailing list