byte count unicode string

Wed Sep 20 05:15:31 EDT 2006

willie wrote:
> >willie wrote:
>  >> Marc 'BlackJack' Rintsch:
>  >>
>  >>  >In <mailman.313.1158732191.10491.python-l... at python.org>, willie
> wrote:
>  >>  >> # What's the correct way to get the
>  >>  >> # byte count of a unicode (UTF-8) string?
>  >>  >> # I couldn't find a builtin method
>  >>  >> # and the following is memory inefficient.
>
>  >>  >> ustr = "example\xC2\x9D".decode('UTF-8')
>
>  >>  >> num_chars = len(ustr)    # 8
>
>  >>  >> buf = ustr.encode('UTF-8')
>
>  >>  >> num_bytes = len(buf)     # 9
>
>  >>  >That is the correct way.
>
>  >> # Apologies if I'm being dense, but it seems
>  >> # unusual that I'd have to make a copy of a
>  >> # unicode string, converting it into a byte
>  >> # string, before I can determine the size (in bytes)
>  >> # of the unicode string. Can someone provide the rational
>  >> # for that or correct my misunderstanding?
>
>  >You initially asked "What's the correct way to get the  byte countof a
>  >unicode (UTF-8) string".
>  >
>  >It appears you meant "How can I find how many bytes there are in the
>  >UTF-8 representation of a Unicode string without manifesting the UTF-8
>  >representation?".
>  >
>  >The answer is, "You can't", and the rationale would have to be that
>  >nobody thought of a use case for counting the length of the UTF-8  form
>  >but not creating the UTF-8 form. What is your use case?
>
> # Sorry for the confusion. My use case is a web app that
> # only deals with UTF-8 strings. I want to prevent silent
> # truncation of the data, so I want to validate the number
> # of bytes that make up the unicode string before sending
> # it to the database to be written.
>
> # For instance, say I have a name column that is varchar(50).
> # The 50 is in bytes not characters. So I can't use the length of
> # the unicode string to check if it's over the maximum allowed bytes.

What is the database API expecting to get as an arg: a  Python unicode
object, or a Python str (8-bit, presumably encoded in utf-8) ?

>
> name = post.input('name') # utf-8 string

You are confusing the hell out of yourself. You say that your web app
deals only with UTF-8 strings. Where do you get "the unicode string"
from??? If name is a utf-8 string, as your comment says, then len(name)
is all you need!!!

*PLEASE* print type(name), repr(name) so that we can see what type it
is!!
If it says the type is str, then it's an 8-bit string, (presumably)
encoded in utf-8.
If it says the type is unicode, then please explain "web app that only
deals with UTF-8 strings" ...

>
> # preferable
> if bytes(name) > 50:
> 	send_http_headers()
> 	display_page_begin()
> 	display_error_msg('the name is too long')
> 	display_form(name)
> 	display_page_end()
>
> # If I have a form with many input elements,
> # I have to convert each to a byte string
> # before i can see how many bytes make up the
> # unicode string. That's very memory inefficient
> # with large text fields - having to duplicate each
> # one to get its size in bytes:

They'd be garbage collected unless you worked very hard to hang on to
them. How large is "large"?