chunking a long string?

Sat Nov 9 03:26:20 EST 2013

On Sat, Nov 9, 2013 at 7:14 PM,  <wxjmfauth at gmail.com> wrote:
> If you wish to count the the frequency of chars in a text
> and store the results in a dict, {char: number_of_that_char, ...},
> do not forget to save the key in utf-XXX, it saves memory.

Oh, if you're that concerned about memory usage of individual
characters, try storing them as integers:

>>> sys.getsizeof("a")
26
>>> sys.getsizeof("a".encode("utf-32"))
25
>>> sys.getsizeof("a".encode("utf-8"))
18
>>> sys.getsizeof(ord("a"))
14

I really don't see that UTF-32 is much advantage here. UTF-8 happens
to be, because I used an ASCII character, but the integer beats them
all, even for larger numbers:
>>> sys.getsizeof(ord("\U0001d11e"))
16

And there's even less difference on my Linux box, but of course, you
never compare against Linux because Python 3.2 wide builds don't suit
your numbers.

For longer strings, there's an even more efficient way to store them.
Just store the memory address - that's going to be 4 bytes or 8,
depending on whether it's a 32-bit or 64-bit build of Python. There's
a name for this method of comparing strings: interning. Some languages
do it automatically for all strings, others (like Python) only when
you ask for it. Suddenly it doesn't matter at all what the storage
format is - if the two strings are the same, their addresses are the
same, and conversely. That's how to make it cheap.

> Hint: If you attempt to do the same exercise with
> words in a "latin" text, never forget the length average
> of a word is approximatively 1000 chars.

I think you're confusing length of word with value of picture.

ChrisA