Need a spider library

Walter Dörwald walter at livinglogic.de
Wed Oct 12 08:19:46 EDT 2005


Laszlo Zsolt Nagy wrote:

> [...]
> For example this malformed link:
> 
> <a href="page.html>Sample link</a>
> 
> could be converted to:
> 
> ['page.html','http://samplesite.current_location/page.html','Samle link']

Your options AFAIK are:
* Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/)
* Various implementations of tidy (uTidyLib, mxTidy)
* XIST (http://www.livinglogic.de/Python/xist)

For XIST code that extracts the above info from a HTML page looks like this:
--------
import sys
from ll import url
from ll.xist import parsers
from ll.xist.ns import html

def links(u):
    node = parsers.parseURL(u, tidy=True, base=None)
    for x in node//html.a:
       yield str(x["href"]), str(u/str(x["href"])), unicode(x)

for data in links(url.URL(sys.argv[1])):
    print data
--------
This outputs something like:

('http://www.python.org/', 'http://www.python.org/', u'\r\n    ')
('http://www.python.org/search/', 'http://www.python.org/search/', 
u'Search')
('http://www.python.org/download/', 'http://www.python.org/download/', 
u'Download')
('http://www.python.org/doc/', 'http://www.python.org/doc/', 
u'Documentation')
...

Hope that helps,
    Walter Dörwald



More information about the Python-list mailing list