Jeremy Hylton : weblog : 2004-01-15

Vector Resizing Algorithm

Thursday, January 15, 2004

Andrew Koenig explains The vector class in C++ STL grows by a factor of 1.5 when it is resized. (Link noted on Amit Patel's blog.) It took my a while to unpack the argument.

The goal is to make the resizing factor be less than (1 + sqrt(5)) / 2 -- the golden ratio. Koenig writes:

Suppose you are using a first-fit memory allocator, and you're progressively appending to a vector. Then each time you reallocate, you allocate new memory, copy the elements, then free the old memory. That leaves a gap, and it would be nice to be able to use that memory eventually. If the vector grows too rapidly, it will always be too big for the available memory.

The last sentence didn't make sense: How would the rate of change of the vector's size affect available memory? After reading the rest of the thread on comp.lang.c++.moderated, I understood that available memory meant unused memory managed by the allocator, rather than system memory that might be allocated by sbrk().

The logic seems to be that if you resize three times, on the third time the free space left by the last two copies of the vector is large enough to hold the newly resized vector. This argument assumes that a realloc() of the vector actually uses all new memory, that the memory would be allocated contiguously, and that other allocations didn't consume too much memory.

I wonder if there are any first-fit allocators in common use. Doug Lea's malloc is a best-fit allocator. He notes: As shown by Wilson et al, best-fit schemes (of various kinds and approximations) tend to produce the least fragmentation on real loads compared to other general approaches such as first-fit. I suppose it doesn't hurt to choose a constant that would work well with this hypothetical first-fit allocator, given that you have to pick some constant.

P.J. Plauger offered a slightly different reason: VC++ V7.0 also uses 1.5. We settled on that factor as a better balance between memory utilization and expansion efficiency. It's harder to argue with that reason.

How about resizing algorithms in Python?

The amount a list object is resized depends on its size. For a list with n elements, when n < 2 ** (5 + 3 * i), the list size is rounded up to the nearest multiple of 2 ** (3 * i). A detailed comment in listobject.c explains:

This over-allocates proportional to the list size, making room for additional growth. The over-allocation is mild, but is enough to give linear-time amortized behavior over a long sequence of appends() in the presence of a poorly-performing system realloc() (which is a reality, e.g., across all flavors of Windows, with Win9x behavior being particularly bad -- and we've still got address space fragmentation problems on Win9x even with this scheme, although it requires much longer lists to provoke them than it used to).

A dictionary is resized whenever its fill count exceeds two-thirds of its size. (Note that the fill count is not the same as the number of keys in the dictionary, because a dictionary may contain internal dummy entries, generated when a key is deleted.) A dictionary quadruples in size when it is resized, unless the dictionary has more than 50000 items, then it doubles in size. The justification is that a sparse dictionary reduces the number of collisions and that quadrupling limits the number of resizes that occur as a dictionary grows. A dictionary resize is expensive -- much more expensive than a list resize -- because it creates a new hash table and inserts each non-dummy item in the old table into the new table.