CPython thread starvation

John Nagle nagle at animats.com
Fri Apr 27 23:35:19 EDT 2012


On 4/27/2012 6:25 PM, Adam Skutt wrote:
> On Apr 27, 2:54 pm, John Nagle<na... at animats.com>  wrote:
>>       I have a multi-threaded CPython program, which has up to four
>> threads.  One thread is simply a wait loop monitoring the other
>> three and waiting for them to finish, so it can give them more
>> work to do.  When the work threads, which read web pages and
>> then parse them, are compute-bound, I've had the monitoring thread
>> starved of CPU time for as long as 120 seconds.
>
> How exactly are you determining that this is the case?

    Found the problem.  The threads, after doing their compute
intensive work of examining pages, stored some URLs they'd found.
The code that stored them looked them up with "getaddrinfo()", and
did this while a lock was set.  On CentOS, "getaddrinfo()" at the
glibc level doesn't always cache locally (ref
https://bugzilla.redhat.com/show_bug.cgi?id=576801).  Python
doesn't cache either.  So huge numbers of DNS requests were being
made.  For some pages being scanned, many of the domains required
accessing a rather slow  DNS server.  The combination of thousands
of instances of the same domain, a slow DNS server, and no caching
slowed the crawler down severely.

    Added a local cache in the program to prevent this.
Performance much improved.

				John Nagle



More information about the Python-list mailing list