Need a spider library
Walter Dörwald
walter at livinglogic.de
Wed Oct 12 08:19:46 EDT 2005
Laszlo Zsolt Nagy wrote:
> [...]
> For example this malformed link:
>
> <a href="page.html>Sample link</a>
>
> could be converted to:
>
> ['page.html','http://samplesite.current_location/page.html','Samle link']
Your options AFAIK are:
* Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/)
* Various implementations of tidy (uTidyLib, mxTidy)
* XIST (http://www.livinglogic.de/Python/xist)
For XIST code that extracts the above info from a HTML page looks like this:
--------
import sys
from ll import url
from ll.xist import parsers
from ll.xist.ns import html
def links(u):
node = parsers.parseURL(u, tidy=True, base=None)
for x in node//html.a:
yield str(x["href"]), str(u/str(x["href"])), unicode(x)
for data in links(url.URL(sys.argv[1])):
print data
--------
This outputs something like:
('http://www.python.org/', 'http://www.python.org/', u'\r\n ')
('http://www.python.org/search/', 'http://www.python.org/search/',
u'Search')
('http://www.python.org/download/', 'http://www.python.org/download/',
u'Download')
('http://www.python.org/doc/', 'http://www.python.org/doc/',
u'Documentation')
...
Hope that helps,
Walter Dörwald
More information about the Python-list
mailing list