developing web spider
Kenji Noguchi
tokyo246 at gmail.com
Fri Apr 4 13:29:19 EDT 2008
Attached is a essence of my crawler. This collects <a> tag in a given URL
HTML parsing is not a big deal as "tidy" does all for you. It converts
a broken HTML
to a valid XHTML. From that point there're wealth of XML libraries. Just write
whatever you want such as <a> element handler.
I've extended it for multi-thread, limit the number of thread for a
specific web host,
more flexible element handling, etc, etc. SQLite is nice for making URL db
by the way.
Kenji Noguchi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crawler.py
Type: text/x-python
Size: 2583 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20080404/38da3cb2/attachment-0001.py>
More information about the Python-list
mailing list