Web Crawler - Python or Perl?

Ray Cote rgacote at AppropriateSolutions.com
Mon Jun 9 15:48:41 EDT 2008


At 11:21 AM -0700 6/9/08, subeen wrote:
>On Jun 10, 12:15 am, Stefan Behnel <stefan... at behnel.de> wrote:
>>  subeen wrote:
>>  > can use urllib2 module and/or beautiful soup for developing crawler
>>
>>  Not if you care about a) speed and/or b) memory efficiency.
>>
>  > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>>
>>  Stefan
>
>ya, beautiful soup is slower. so it's better to use urllib2 for
>fetching data and regular expressions for parsing data.
>
>
>regards,
>Subeen.
>http://love-python.blogspot.com/
>--
>http://mail.python.org/mailman/listinfo/python-list

Beautiful Soup is a bit slower, but it will actually parse some of 
the bizarre HTML you'll download off the web. We've written a couple 
of crawlers to run over specific clients sites (I note, we did _not_ 
create the content on these sites).

Expect to find html code that looks like this:

<ul>
<li>
<form>
</li>
</form>
</ul>
[from a real example, and yes, it did indeed render in IE.]

I don't know if some of the quicker parsers discussed require 
well-formed HTML since I've not used them. You may want to consider 
using one of the quicker HTML parsers and, when they throw a fit on 
the downloaded HTML, drop back to Beautiful Soup -- which usually 
gets _something_ useful off the page.

--Ray

-- 

Raymond Cote
Appropriate Solutions, Inc.
PO Box 458 ~ Peterborough, NH 03458-0458
Phone: 603.924.6079 ~ Fax: 603.924.8668
rgacote(at)AppropriateSolutions.com
www.AppropriateSolutions.com



More information about the Python-list mailing list