chunking a long string?

wxjmfauth at gmail.com wxjmfauth at gmail.com
Sat Nov 9 03:14:20 EST 2013


Le samedi 9 novembre 2013 01:46:32 UTC+1, Steven D'Aprano a écrit :
> On Fri, 08 Nov 2013 12:43:43 -0800, wxjmfauth wrote:
> 
> 
> 
> > "(say, 1 kbyte each)": one "kilo" of characters or bytes?
> 
> > 
> 
> > Glad to read some users are still living in an ascii world, at the
> 
> > "Unicode time" where an encoded code point size may vary between 1-4
> 
> > bytes.
> 
> > 
> 
> > 
> 
> > Oops, sorry, I'm wrong, 
> 
> 
> 
> That part is true.
> 
> 
> 
> 
> 
> > it can be much more.
> 
> 
> 
> That part is false. You're measuring the overhead of the object 
> 
> structure, not the per-character storage. This has been the case going 
> 
> back since at least Python 2.2: strings are objects, and have overhead.
> 
> 
> 
> >>>> sys.getsizeof('ab')
> 
> > 27
> 
> 
> 
> 27 bytes for two characters! Except it isn't, it's actually 25 bytes for 
> 
> the object header and two bytes for the two characters.
> 
> 
> 
> >>>> sys.getsizeof('a\U0001d11e')
> 
> > 48
> 
> 
> 
> And here you have four bytes each for the two characters and a 40 byte 
> 
> header. Observe:
> 
> 
> 
> py> c = '\U0001d11e'
> 
> py> len(c)
> 
> 1
> 
> py> sys.getsizeof(2*c) - sys.getsizeof(c)
> 
> 4
> 
> py> sys.getsizeof(1000*c) - sys.getsizeof(999*c)
> 
> 4
> 
> 
> 
> 
> 
> How big is the object overhead on a (say) thousand character string? Just 
> 
> one percent:
> 
> 
> 
> py> (sys.getsizeof(1000*c) - 4000)/4000
> 
> 0.01


--------

Sure, the new phone "xyz" does not cost 600$, it only cost
only 100$ more than the "abc" 500$ phone model.


If you wish to count the the frequency of chars in a text
and store the results in a dict, {char: number_of_that_char, ...},
do not forget to save the key in utf-XXX, it saves memory.

After all, it is much more funny to waste its time in coding
and in attempting to handle unicode properly and to observe
this poor Python wasting its time in conversions.

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('\U0001d11e')
44
>>> sys.getsizeof('\U0001d11e'.encode('utf-32'))
25


Hint: If you attempt to do the same exercise with
words in a "latin" text, never forget the length average
of a word is approximatively 1000 chars.

jmf





More information about the Python-list mailing list