developing web spider

John Nagle nagle at animats.com
Wed Apr 2 12:01:15 EDT 2008


abeen wrote:
> Hello,
> 
> I would want to know which could be the best programming language for
> developing web spider.
> More information about the spider, much better,,

    As someone who actually runs a Python based web spider in production, I
should comment.

    You need a very robust parser to parse real world HTML.
Even the stock version of BeautifulSoup isn't good enough.  We have a
modified version of BeautifulSoup, plus other library patches, just to
keep the parser from blowing up or swallowing the entire page into
a malformed comment or tag.  Browsers are incredibly forgiving in this
regard.

    "urllib" needs extra robustness, too.  The stock timeout mechanism
isn't good enough.  Some sites do weird things, like open TCP connections
for HTTP but not send anything.

    Python is on the slow side for this.  Python is about 60x
slower than C, and for this application, you definitely see that.
A Python based spider will go compute bound for seconds per page
on big pages.  The C-based parsers for XML/HTML aren't robust enough for
this application.  And then there's the Global Interpreter Lock; a multicore
CPU won't help a multithreaded compute-bound process.

    I'd recommend using Java or C# for new work in this area
if you're doing this in volume.  Otherwise, you'll need to buy
many, many extra racks of servers.  In practice, the big spiders
are in C or C++.

> http://www.immavista.com

    Lose the ad link.

					John Nagle



More information about the Python-list mailing list