slowdown with massive memory usage

Sun Aug 1 18:24:52 EDT 2004

On 01 Aug 2004 22:08:14 +0200, Hallvard B Furuseth <h.b.furuseth at usit.uio.no> wrote:

>Andrew MacIntyre wrote:
>> On Sat, 30 Jul 2004, Hallvard B Furuseth wrote:
>>> I have a program which starts by reading a lot of data into various
>>> dicts.
>>>
>>> When I moved a function to create one such dict from near the beginning
>>> of the program to a later time, that function slowed down by a factor
>>> of 8-14:
>> (...)
>>
>> Python 2.2 didn't use PyMalloc by default.  This leaves Python at the
>> mercy of the platform malloc()/realloc()/free(), and Python has found
>> rough spots with nearly every platform's implementation of these - which
>> is why PyMalloc was written.
>> 
>> While it isn't certain that this is your problem, if you can rebuild your
>> Python interpreter to include PyMalloc (--with-pymalloc I think), you can
>> find out.
>
>Thanks.  I'll check that when I get time.  Until then, malloc gets the
>blame until proven innocent, since profiling and test output turned out
>nothing else that was different.  (See my reply to Istvan.)
>
>> Be warned that there were some bugs in PyMalloc that were fixed before
>> Python 2.3 was released (when PyMalloc became a default option); as far as
>> I recall, these bugfixes were never backported to 2.2x.  So I wouldn't
>> recommend running a 2.2.3 PyMalloc enabled interpreter in production
>> without seriously testing all your production code.
>
>If PyMalloc helps, I'll push for an upgrade to 2.3.  Thanks again.
>
Speculating broadly here, but have you considered possible cache effects?
I.e., instructions execute faster when they and their operands can be fetched
from the CPU cache, and similarly L2 cache is faster than RAM. What is in
the caches at any point depends on what has recently been executed, and
how different memory areas map into the cache, and that will probably depend
on where you have put things in your program and what order you call for its
execution (the OS kernel may also affect the cache via interrupt service routines
and/or multitasking etc, e.g., for downloading or playing music in the background
(which I doubt you did) ;-) ). If you have multiple CPUs they do better if they
don't work on each others' jobs too much, since a switch tends to mess up caching.
I think some older kernels don't take that into account, but maybe that's all
history by now.

In a loop, typically the first time through will show cacheloading overhead, and
the rest will benefit, with blips for interrupts or interpreter special effects
such as extending an allocation pool or garbage collecting. These get washed out
in big averages, or filtered out in best-of timings, but they can be seen if you
create a graphic that shows every timing (e.g. a raster of dots colored by time
if there's a lot of timings). (Of course you have to watch out that your data
capture doesn't cause overhead that invalidates your results. It can be tricky.)

Another effect that has shown up as mystery culprit in the past is CPU heating and
consequent automatic slowing of the clock to prevent damage, but that doesn't
seem that likely in this case.

Another way you could lose time is if your code gets into a new relationship in
time with some other code than yours. Just speculating in general here, but if you are
processing data coming from a disk or other i/o that has some natural clumping
to it in the OS, such as waiting for an interrupt that says the next cluster read
is ready to fill buffers from, and for one arrangement of your code that happened
just before you executed and the other way just after, then there would be a difference
in interfering cache effects due to OS activity. Also, if you do a succession of i/o
that can't physically happen back to back, then you should be able to gain by doing
some computing in between. OS buffering and disk caches mitigate this, but you can empty or stuff
them so they demand physical i/o, depending on your program. Moving such code would
presumably have an effect on overall timing.

Of course, if you are also doing multi-threaded stuff in your program, it's another ball game.
My USD.02

Regards,
Bengt Richter