how to get size of unicode string/string in bytes ?

Walter Dörwald walter at livinglogic.de
Wed Aug 2 12:48:14 EDT 2006


Diez B. Roggisch wrote:
>> So then the easiest thing to do is: take the maximum length of a unicode
>> string you could possibly want to store, multiply it by 4 and make that
>> the length of the DB field.
>  
>> However, I'm pretty convinced it is a bad idea to store Python unicode
>> strings directly in a DB, especially as they are not portable. I assume
>> that some DB connectors honour the local platform encoding already, but
>> I'd still say that UTF-8 is your best friend here.
> 
> It was your assumption that the OP wanted to store the "real"
> unicode-strings. A moot point anyway, at it is afaik not possible to get
> their contents in byte form (except from a C-extension).

It is possible:

>>> u"a\xff\uffff\U0010ffff".encode("unicode-internal")
'a\x00\xff\x00\xff\xff\xff\xdb\xff\xdf'

This encoding is useless though, as you can't use it for reencoding on
another platform. (And it's probably not what the OP intended.)

> And assuming 4 bytes per character is a bit dissipative I'd say - especially
> when you have some > 80% ascii-subset in your text as european and american
> languages have.

That would require UTF-32 as an encoding, which Python currently doesn't
have.

> The solution was given before: chose an encoding (utf-8 is certainly the
> most favorable one), and compute the byte-string length.

Exactly!

Servus,
   Walter



More information about the Python-list mailing list