CPython thread starvation

Sun Apr 29 13:26:43 EDT 2012

On 4/28/2012 1:04 PM, Paul Rubin wrote:
> Roy Smith<roy at panix.com>  writes:
>> I agree that application-level name cacheing is "wrong", but sometimes
>> doing it the wrong way just makes sense.  I could whip up a simple
>> cacheing wrapper around getaddrinfo() in 5 minutes.  Depending on the
>> environment (both technology and bureaucracy), getting a cacheing
>> nameserver installed might take anywhere from 5 minutes to a few days to ...
>
> IMHO this really isn't one of those times.  The in-app wrapper would
> only be usable to just that process, and we already know that the OP has
> multiple processes running the same app on the same machine.  They would
> benefit from being able to share the cache, so now your wrapper gets
> more complicated.  If it's not a nameserver then it's something that
> fills in for one.  And then, since the application appears to be a large
> scale web spider, it probably wants to run on a cluster, and the cache
> should be shared across all the machines.  So you really probably want
> an industrial strength nameserver with a big persistent cache, and maybe
> a smaller local cache because of high locality when crawling specific
> sites, etc.

     Each process is analyzing one web site, and has its own cache.
Once the site is analyzed, which usually takes about a minute,
the cache disappears.  Multiple threads are reading multiple pages
from the web site during that time.

     A local cache is enough to fix the huge overhead problem of
doing a DNS lookup for every link found.  One site with a vast
number of links took over 10 hours to analyze before this fix;
now it takes about four minutes.  That solved the problem.
We can probably get an additional minor performance boost with a real
local DNS daemon, and will probably configure one.

     We recently changed servers from Red Hat to CentOS, and management
from CPanel to Webmin.  Before the change, we had a local DNS daemon
with cacheing, so we didn't have this problem.  Webmin's defaults
tend to be on the minimal side.

     The DNS information is used mostly to help decide whether two URLs
actually point to the same IP address, as part of deciding whether a
link is on-site or off-site.  Most of those links will never be read.
We're not crawling the entire site, just looking at likely pages to
find the name and address of the business behind the site.  (It's
part of our "Know who you're dealing with" system, SiteTruth.)
			
				John Nagle