Myth or Urban Legend? Python => Google [ was: Why learn Python ??]

EP EP at zomething.com
Tue Jan 13 21:38:45 EST 2004


Is it true that the original Google spider was written in Python?

I came across a paper on the web some time back that I saved and read just 
last night:


     The Anatomy of a Large-Scale Hypertextual Web Search Engine

     Sergey Brin and Lawrence Page
     {sergey, page}@cs.stanford.edu
     Computer Science Department, Stanford University, Stanford, CA 94305

A neat read, but I'm not sure of the authenticity of the paper: I could be 
gullible.  It would appear to be a paper written some years back on the 
genesis of the Google search engine.

[excerpt]
Running a web crawler is a challenging task. There are tricky performance 
and reliability issues and even more importantly, there are social issues. 
Crawling is the most fragile application since it involves interacting with 
hundreds of thousands of web servers and various name servers which are all 
beyond the control of the system.
In order to scale to hundreds of millions of web pages, Google has a fast 
distributed crawling system. A single URLserver serves lists of URLs to a 
number of crawlers (we typically ran about 3). Both the URLserver and the 
crawlers are implemented in Python. Each crawler keeps roughly 300 
connections open at once. This is necessary to retrieve web pages at a fast 
enough pace. At peak speeds, the system can crawl over 100 web pages per 
second using four crawlers. This amounts to roughly 600K per second of 
data. A major performance stress is DNS lookup. Each crawler maintains a 
its own DNS cache so it does not need to do a DNS lookup before crawling 
each document. Each of the hundreds of connections can be in a number of 
different states: looking up DNS, connecting to host, sending request, and 
receiving response. These factors make the crawler a complex component of 
the system. It uses asynchronous IO to manage events, and a number of 
queues to move page fetches from state to state.
[/excerpt]

It would seem like the poster boy example for using Python in some 
respects, if true.


Eric, Intrigued

"but at least I didn't top post"


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20040113/c0789a8e/attachment.html>


More information about the Python-list mailing list