[medusa] Re: hideously long selects?

Skip Montanaro skip@m...
Fri, 3 Sep 1999 12:50:10 -0500


I'm baaack...


Jim Fulton asked:

> What version of Zope are you running? I beleave that this is due to a 
> bug that was fixed in the Zope 2.0 beta 5. 

I was running 1.something. I downloaded the 2.0.0 source today and
installed it, then modified my old start.py to work with the new ZServer.

I tried it out a bit on my development machine. Things were *real slow*,
but nothing was happening, so I applied strace (fast becoming one of my
favorite debugging tools) to both my database server and the ZServer
process. The results were instructive. The database server trace looks
like:

12:12:36 select(6, [5], [], [], {2, 500000}) = 0 (Timeout)
12:12:39 select(6, [5], [], [], {2, 500000}) = 1 (in [5], left {0, 280000})
12:12:41 accept(5, {sin_family=AF_INET, sin_port=htons(4321), sin_addr=inet_addr("127.0.0.1")}, [16]) = 12
... chunka chunka ...
12:12:41 read(13, "POST /RPC2 HTTP/1.0\r\nHost: loc"..., 1024) = 43
12:12:41 read(13, "User-Agent: xmlrpclib.py/0.9.8 ("..., 1024) = 56
12:12:41 read(13, "Content-Type: text/xml\r\nConten"..., 1024) = 170
... chunka chunka ...
12:12:41 write(14, "HTTP/1.0 200 OK\r\nServer: BaseH"..., 1024) = 1024
... bunch more writes ...
12:12:41 close(15) = 0
... a few munmaps ...
12:12:41 select(6, [5], [], [], {2, 500000}) = 0 (Timeout)
...

So, within the space of somewhat less than a second it accepts a connect,
handles the request (in this case an XML-RPC call requesting the top 50
cities and performers the server has gotten requests for), then returns to
its waiting state.

Now, lets have a look at the trace for the ZServer:

12:12:39 accept(5, {sin_family=AF_INET, sin_port=htons(4319), sin_addr=inet_addr("127.0.0.2")}, [16]) = 6
12:12:39 write(1, "info: adding channel <zhttp_chan"..., 72info: adding channel <zhttp_channel at 8268ae0 channel#: 0 requests:0>) = 72
12:12:39 fcntl(6, F_GETFL) = 0x2 (flags O_RDWR)
12:12:39 fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0
12:12:39 gettimeofday({936378759, 269643}, NULL) = 0
12:12:39 gettimeofday({936378759, 270168}, NULL) = 0
12:12:39 select(7, [5 6], [], [], {30, 0}) = 1 (in [6], left {30, 0})
12:12:39 recv(6, "GET /filter?file=index.shtml HTT"..., 4096, 0) = 277
12:12:39 gettimeofday({936378759, 273572}, NULL) = 0
12:12:39 brk(0x827b000) = 0x827b000
12:12:39 pipe([11, 12]) = 0
12:12:39 clone() = 8159
12:12:39 write(12, "4\24%@\4\0\0\0\201\0\0\0 \205$@`"..., 148) = 148
12:12:39 sigprocmask(SIG_SETMASK, NULL, [USR1]) = 0
12:12:39 write(12, "4\24%@\0\0\0\0\0\0\0\0T\370\7\10"..., 148) = 148
12:12:39 sigprocmask(SIG_SETMASK, NULL, [USR1]) = 0
12:12:39 sigsuspend([] <unfinished ...>
12:12:39 --- SIGUSR1 (User defined signal 1) ---
12:12:39 <... sigsuspend resumed> ) = -1 EINTR (Interrupted system call)
12:12:39 sigreturn() = ? (mask now [USR1])
12:12:39 kill(8160, SIGUSR1) = 0
12:12:39 kill(8160, SIGUSR1) = 0
12:12:39 sigprocmask(SIG_SETMASK, NULL, [USR1]) = 0
12:12:39 sigsuspend([] <unfinished ...>
12:12:39 --- SIGUSR1 (User defined signal 1) ---
12:12:39 <... sigsuspend resumed> ) = -1 EINTR (Interrupted system call)
12:12:39 sigreturn() = ? (mask now [USR1])
12:12:39 select(7, [5 6], [], [], {30, 0}) = 0 (Timeout)
12:13:09 select(7, [5 6], [6], [], {30, 0}) = 1 (out [6], left {30, 0})
12:13:09 send(6, "HTTP/1.0 200 OK\r\nServer: Zope/"..., 13297, 0) = 13297
12:13:09 select(7, [5 6], [6], [], {30, 0}) = 1 (out [6], left {30, 0})
12:13:09 gettimeofday({936378789, 284258}, NULL) = 0
12:13:09 write(4, "127.0.0.2:4319 - - [03/Sep/1999:"..., 98) = 98
12:13:09 write(1, "info: closing channel 6:<zhttp_c"..., 99info: closing channel 6:<zhttp_channel connected 127.0.0.2:4319 at 8268ae0 channel#: 0 requests:1>) = 99
12:13:09 close(6) = 0
12:13:09 close(-1) = -1 EBADF (Bad file descriptor)
12:13:09 select(6, [5], [], [], {30, 0}) = 0 (Timeout)

I assume the chit-chat with the backend database server is happening in a
separate thread, because I see no reads or writes around 12:12:41 in the
trace. At any rate, strace seems not to be able to attach to those threads, 
so their chit-chat is unavailable.

Note the back-to-back selects at 12:12:39 and 12:13:09. First it has fd 6
only in the read set and the select timed out after 30 seconds. Then it
added 6 to the write set for the second select which returned immediately,
and it sent the response back to Apache (which is serving as a proxy in this
case) and from there to the user.

Why'd it sit on that first select for 30 seconds? Any reason why fd 6
wasn't in the write set to begin with? I presume it is the socket which it
uses to talk to the backend server. (This would be in httplib or stuff it
calls.)

To make sure I was getting appropriate versions of important modules, I had
start.py print out sys.path and the locations of those modules I thought
most important when ZServer was started:

sys.path is (ordered first to last):
0: /usr/local/lib/python1.5/site-packages
1: /home/dolphin/skip/src/Zope-2.0.0-src/ZServer/../lib/python
2: /home/dolphin/skip/src/Zope-2.0.0-src/ZServer
3: /home/dolphin/skip/src/Zope-2.0.0-src/ZServer/..
4: /home/dolphin/skip/src/Zope-2.0.0-src/ZServer
5: /home/dolphin/skip/src/vtk-linux/lib
6: /usr/local/lib/python1.5/
7: /usr/local/lib/python1.5/plat-linux2
8: /usr/local/lib/python1.5/lib-tk
9: /usr/local/lib/python1.5/lib-dynload
10: /usr/local/lib/automatrix/python

Here are the locations of some key modules, as discovered by
imp.find_module:

asyncore is at /home/dolphin/skip/src/Zope-2.0.0-src/ZServer/medusa/asyncore.py
asynchat is at /home/dolphin/skip/src/Zope-2.0.0-src/ZServer/medusa/asynchat.py
xmlrpclib is at /usr/local/lib/python1.5/site-packages/xmlrpclib.py
sgmlop is builtin

Now, if I shorten up the timeout for asyncore.loop, the pregnant pause goes
away, but there's still this fd 6 in the read set followed by fd 6 in both
the read and write sets:

12:39:13 sigreturn() = ? (mask now [USR1])
12:39:13 select(7, [5 6], [], [], {0, 800000}) = 0 (Timeout)
12:39:14 select(7, [5 6], [6], [], {0, 800000}) = 1 (out [6], left {0, 800000})
12:39:14 send(6, "HTTP/1.0 200 OK\r\nServer: Zope/"..., 37112, 0) = 37112

I would agree that long timeouts are okay if we could just eliminate that
first select.

Am I hosed because I'm mixing blocking (httplib) and non-blocking (asyncore)
I/O? Should httplib be telling asyncore about its sockets?

Skip Montanaro | http://www.mojam.com/
skip@m... | http://www.musi-cal.com/~skip/
847-971-7098 | Python: Programming the way Guido indented...