Python for Webscripting (like PHP)

Terry Reedy tjreedy at udel.edu
Thu Aug 18 17:55:33 EDT 2005


"Peter Hansen" <peter at engcorp.com> wrote in message 
news:yPednZ2dnZ3Eg1n1nZ2dnREBmd6dnZ2dRVn-z52dnZ0 at powergate.ca...
> Alessandro Bottoni wrote:
>> (Python has even been told to be used by Yahoo! and Google, among 
>> others,
>> but nobody was able to demonstrate this, so far)
>
> Nobody, except Google's founders?
>
> http://www-db.stanford.edu/~backrub/google.html

I think the relevant paragraph is worth quoting here (****s added):
"
In order to scale to hundreds of millions of web pages, Google has a fast 
distributed crawling system. A single URLserver serves lists of URLs to a 
number of crawlers (we typically ran about 3). Both the URLserver and the 
crawlers are implemented in **Python**. Each crawler keeps roughly 300 
connections open at once. This is necessary to retrieve web pages at a fast 
enough pace. At peak speeds, the system can crawl over 100 web pages per 
second using four crawlers. This amounts to roughly 600K per second of 
data. A major performance stress is DNS lookup. Each crawler maintains a 
its own DNS cache so it does not need to do a DNS lookup before crawling 
each document. Each of the hundreds of connections can be in a number of 
different states: looking up DNS, connecting to host, sending request, and 
receiving response. These factors make the crawler a complex component of 
the system. It uses asynchronous IO to manage events, and a number of 
queues to move page fetches from state to state.
"
This seems to have been about 2000.  Of course, bottleneck code may have 
been rewritten in C, but Google continues to hire Python programmers (among 
others).

Terry J. Reedy






More information about the Python-list mailing list