tornado.web ioloop add_timeout eats CPU

Tue Sep 4 03:30:07 EDT 2012

> What's wrong is the 1,135,775 calls to "method 'poll' of
> 'select.epoll' objects".
I was affraid you are going to say that. :-)
> With five browsers waiting for messages over 845 seconds, that works
> out to each  waiting browser inducing 269 epolls per second.
>
> Almost equally important is what the problem is *not*. The problem is
> *not* spending the vast majority of time in epoll; that's *good* news.
> The problem is *not* that CPU load goes up linearly as we connect more
> clients. This is an efficiency problem, not a scaling problem.
>
> So what's the fix? I'm not a Tornado user; I don't have a patch.
> Obviously Laszlo's polling strategy is not performing, and the
> solution is to adopt the event-driven approach that epoll and Tornado
> do well.
Actually, I have found a way to overcome this problem, and it seems to 
be working. Instead of calling add_timeout from every request, I save 
the request objects in a list, and operate a "message distributor" 
service in the background that routes messages to clients, and finish 
their long poll requests when needed. The main point is that the 
"message distributor" has a single entry point, and it is called back at 
given intervals. So the number of callbacks per second does not increase 
with the number of clients. Now the CPU load is about 1% with one 
client, and it is the same with 15 clients. While the response time is 
the same (50-100msec). It is efficient enough for me.

I understand that most people do a different approach: they do a fast 
poll request from the browser in every 2 seconds or so. But this is not 
good for me, because then it can take 2 seconds to send a message from 
one browser into another that is not acceptable in my case. Implementing 
long polls with a threaded server would be trivial, but a threaded 
server cannot handle 100+ simultaneous (long running) requests, because 
that would require 100+ threads to be running.

This central "message distributor" concept seems to be working. About 
1-2% CPU overhead I have to pay for being able to send messages from one 
browser into another within 100msec, which is fine.

I could have not done this without your help.

Thank you!

    Laszlo