Slow Ref cleanup in 2.2.1 with 100K+ objects on Linux. - or SAP DBAPI problem?

Thu May 22 19:49:31 EDT 2003

I've run into a strange situation, looking for commentary. I'll have done my
own workaround by the time you get this, so I don't need a critical answer.

I've written a program to copy database records from Interbase to SAP. I
have to change foreign key values during this process by looking up parent
table records as child records are copied.

I'm copying one table at a time, but a child may reference several other
parent tables. I keep a cache using a dict of these other records.

So .. 230K+ records have been copied, and during this process it probably
pulled in 200K+ other related parent records and is keeping all of those in
RAM (duh, bad design).

But interestingly, the copy subroutine has finished, and is "returning" to
it's parent, however the process seems hung. I thought I had read somewhere
about a problem cleaning up ref's (There aren't any cycles in this case),
but I can't recall exactly.

This is python 2.2.1 on RH 9

Dual XEON 2.2 GHZ machine with 1GB RAM.

Look at top:

  7:35pm  up 38 days,  9:07,  4 users,  load average: 6.37, 6.93, 6.68
198 processes: 173 sleeping, 1 running, 24 zombie, 0 stopped
CPU0 states:  3.3% user,  1.4% system,  0.0% nice, 94.1% idle
CPU1 states:  2.4% user,  3.1% system,  0.0% nice, 93.3% idle
CPU2 states:  0.4% user,  0.5% system,  0.0% nice, 98.1% idle
CPU3 states:  0.2% user,  0.4% system,  0.1% nice, 98.3% idle
Mem:  1030372K av, 1020480K used,    9892K free,       0K shrd,    2624K
buff
Swap: 4096552K av, 1370228K used, 2726324K free                   21208K
cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
16173 bkc       10  -5  988M 762M  718M D <   0.3 75.7  38:00 python2
SQLCopy.py -e /tmp/copyerrors2.txt --copy STSouthTables STSouthTablesS
Charges +

And vmstat

[bkc at strader /service]# vmstat 5
   procs                      memory    swap          io     system
cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
id
 0  1  0 1370228   9668   2572  19964   0   0     2     1    0     1   1   1
1
 0  2  0 1370228   9692   2628  22872 816   0  1431   131  840  1423   2   2
96
 0 12  4 1370284  10192   2632  23652  55 1902   466  1927 1010   993   1
2  97
 0 17  4 1370284  10180   2632  23684  26 1466    32  1466  985   623   1
1  98
 0 17  4 1370680  10064   2364  23960  37 1468    38  1468  931   326   1
1  98
 0  3  0 1370604   9336   2100  31420 402 230  2690   314  921  1381   3   4
93

It looks to me like all the processes are deadlocked trying to swap in. I
niced python a little but that made no change.

Note the load average is right up there, but the CPU time is 95% idle!

Anyone seen a condition like this before?

--

Hey, I tried to ^C it and I got:

    -11987 COMMUNIC sql03_catch_signal: caught signal 2

    -11987 COMMUNIC sql03_catch_signal: caught signal 2

I had to kill it. I wonder if it was blocked in SAP DBAPI and it's not a
cleanup problem.

Oh well, just musing

I'll re-run this with a smarter cache design and see what happens.

--
Novell DeveloperNet Sysop #5

_