Is Python good for web crawlers?

Tim Parkin tim at pollenation.net
Tue Feb 7 12:35:57 EST 2006


Tempo wrote:

>Does a web crawler have to download an entire page if it only needs to
>check if the product is in stock on a page? Or if it just needs to
>search for one match of a certain word on a page?
>
>  
>
Typically you would download the whole html file and then perform any
analysis on this. It is possible to parse the stream of characters as
they come back from the server but this would statistically only reduce
the download time by a half (presuming the item you want is of a single
byte in length and can appear anywhere in the html). In reality, unless
the pages you are requesting are very large (200k+) or your bandwidth
very expensive (in time and/or capacity) then it is probably easier for
you to just download the whole file.

I would recommend that you use BeautifulSoup to parse badly formatted
html documents (which is most of the web). (google 'beautiful soup' and
you should find it easily).

Tim Parkin



More information about the Python-list mailing list