developing web spider

Kenji Noguchi tokyo246 at gmail.com
Fri Apr 4 13:29:19 EDT 2008


Attached is a essence of my crawler.  This collects <a> tag in a given URL

HTML parsing is not a big deal as "tidy" does all for you. It converts
a broken HTML
to a valid XHTML.  From that point there're wealth of XML libraries. Just write
whatever you want such as <a> element handler.

I've extended it for multi-thread, limit the number of thread for a
specific web host,
more flexible element handling, etc, etc. SQLite is nice for making URL db
by the way.

Kenji Noguchi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crawler.py
Type: text/x-python
Size: 2583 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20080404/38da3cb2/attachment-0001.py>


More information about the Python-list mailing list