Python web client anyone?
Richard Jones
richard at bizarsoftware.com.au
Sun Oct 14 20:57:42 EDT 2001
On Monday 15 October 2001 10:25, Paul Rubin wrote:
> Richard Jones <richard at bizarsoftware.com.au> writes:
> > > Thanks, this appears to include an HTTP client, which is a start, but
> > > I was looking for something that actually parses the HTML on the
> > > retrieved page like LWP does. I wonder if there's some way to do that
> > > with the XML libraries (though HTML is generally not well-formed
> > > XML--for example it usually has unterminated <P> tags). Any
> > > thoughts?
> >
> > htmllib?
> >
> > If you want quick and simple DOM extraction, I have a module that extends
> > HTMLParser...
>
> Perl LWP is a module for easily writing robot web clients.
There's been a lot of work done with Python robots in the past. Search the
web. For a good laff, read:
http://www.w3.org/Tools/Python/Overview.html
Specifically: "
I have written a robot that does this, except it doesn't check for
valid SGML -- it just tries to map out the entire web. I believe I
found roughly 50 or 60 different sites (this was maybe 2 months ago --
I'm sorry, I didn't save the output). It took the robot about half a
day (a saturday morning) to complete.
"
The demo Guido mentions is no longer distributed with Python. There is a web
tool in the Tools/webchecker directory of the distribution though.
> It doesn't
> exactly make a DOM, but it's the same idea, so DOM extraction would be
> fine. What I *really* want is to be able to easily find link objects
> (anchor tags) based on the anchor text, which LWP for some reason
> doesn't do, but DOM extraction would be a start. By "anchor text" I
> mean the text in <a href=blah.html>this is the anchor text</a>. The
> client should be able to find some "underlined" text on the page it
> retrieves, and "click" on the linked document.
My SimpleDOM module will do this. Here's some off-the-top-of-my-head usage:
>>> import SimpleDOM, urllib
>>> page = urllib.urlopen('http://www.w3c.org/').read()
>>> dom = SimpleDOM.SimpleDOMParser()
>>> dom.parseString(page)
>>> len(dom.getByName('a'))
144
>>> dom.getByName('a')[0]
<SimpleDOMNode "a" {'title': 'W3C Activities', 'class': 'bannerLink', 'href':
'Consortium/Activities'} (1 elements)>
Note that the parser is not particularly generous in accepting bad inputs. In
the example above, I fed it:
1. www.python.org (mismatched tag)
2. www.bizarsoftware.com.au (*blush* bad entity)
> I may not have read the htmllib docs carefuly enough but it looks more
> intended for formatting/displaying HTML than parsing it. Are your
> DOM extensions available?
http://bigboy.bizarsoftware.com.au/~richard/SimpleDOM.py
Richard
More information about the Python-list
mailing list