chunking a long string?

Fri Nov 8 19:46:32 EST 2013

On Fri, 08 Nov 2013 12:43:43 -0800, wxjmfauth wrote:

> "(say, 1 kbyte each)": one "kilo" of characters or bytes?
> 
> Glad to read some users are still living in an ascii world, at the
> "Unicode time" where an encoded code point size may vary between 1-4
> bytes.
> 
> 
> Oops, sorry, I'm wrong, 

That part is true.

> it can be much more.

That part is false. You're measuring the overhead of the object 
structure, not the per-character storage. This has been the case going 
back since at least Python 2.2: strings are objects, and have overhead.

>>>> sys.getsizeof('ab')
> 27

27 bytes for two characters! Except it isn't, it's actually 25 bytes for 
the object header and two bytes for the two characters.

>>>> sys.getsizeof('a\U0001d11e')
> 48

And here you have four bytes each for the two characters and a 40 byte 
header. Observe:

py> c = '\U0001d11e'
py> len(c)
1
py> sys.getsizeof(2*c) - sys.getsizeof(c)
4
py> sys.getsizeof(1000*c) - sys.getsizeof(999*c)
4

How big is the object overhead on a (say) thousand character string? Just 
one percent:

py> (sys.getsizeof(1000*c) - 4000)/4000
0.01

-- 
Steven